liampshaw / Pathogen-host-range

MIT License
2 stars 2 forks source link

your indexed records for Guan et al. 2003 #4

Open jhpoelen opened 3 years ago

jhpoelen commented 3 years ago

Hi @liampshaw -

Thanks for helping to make existing biotic interaction data easier to find and access!

I was just looking at your GloBI indexed record at https://www.globalbioticinteractions.org/?interactionType=interactsWith&sourceTaxon=Rhinolophus%20pusillus&targetTaxon=Coronaviridae and I was wondering about how your are handling you pubmed references.

For instance, on 2021-10-09T05:33:05.476Z , I accessed attached file with hash://sha256/1176267c7468d009a552e38ae9c0c247f723a65b67b68c72f5366af2e882e8cf at https://github.com/liampshaw/Pathogen-host-range/raw/master/data/PathogenVsHostDB-2019-05-30.csv .

I noticed that a pubmed reference id 17451830 associated with (short) reference Guan et al. 2003, but also Wang et al. 2005 and [blank] .

I produced the table below using:

$ cat 1176267c7468d009a552e38ae9c0c247f723a65b67b68c72f5366af2e882e8cf | tr '\r' '\n' | head -n1 > shaw2020Notes17451830.tsv
$ cat 1176267c7468d009a552e38ae9c0c247f723a65b67b68c72f5366af2e882e8cf | tr '\r' '\n' | grep "17451830" >> shaw2020Notes17451830.tsv

However, when looking up https://pubmed.ncbi.nlm.nih.gov/17451830/ , it appears that the authors of the review article are Zhengli Shi 1, Zhihong Hu .

Can you please help me understand how the short reference and the PMID are related?

  Type Phylum Class Order Family Genus Species Human Zoonotic Domestic Wildlife HostGroup HostOrder HostFamily HostSpecies HostName Association Method Reference PMID Additional Synonym Disease Taylor Emerging Emerging.Source VectorBorne GramStain Motility Spore Oxygen Cell Infection Abbreviation Strain Genome Gsize ProteinCount PercentGC HostSpeciesPHB HostSpecies.new N.genomes Genome.size Genome.GC Genome.genes
7604 Virus NA NA Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Yes Yes No Yes Carnivora Carnivora Mustelidae Melogale moschata Chinese ferret-badger na Antibodies Guan et al. 2003 17451830       No Yes Jones No NA NA NA NA NA NA SARSr-CoV   (+) ssRNA 0.029751 14 40.8 Melogalemoschata Melogale moschata 1 0.029751 40.8 13
7606 Virus NA NA Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Yes Yes No Yes Chiroptera Chiroptera Pteropodidae Rousettus leschenaultii Leschenault's rousette na Antibodies   17451830       No Yes Jones No NA NA NA NA NA NA SARSr-CoV   (+) ssRNA 0.029751 14 40.8 Rousettusleschenaultii Rousettus leschenaultii 1 0.029751 40.8 13
7610 Virus NA NA Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Yes Yes No Yes Chiroptera Chiroptera Rhinolophidae Rhinolophus macrotis Big-eared horseshoe bat na PCR   17451830 ViPR     No Yes Jones No NA NA NA NA NA NA SARSr-CoV   (+) ssRNA 0.029751 14 40.8 Rhinolophusmacrotis Rhinolophus macrotis 1 0.029751 40.8 13
7612 Virus NA NA Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Yes Yes No Yes Chiroptera Chiroptera Rhinolophidae Rhinolophus pusillus Least horseshoe bat na PCR   17451830 ViPR     No Yes Jones No NA NA NA NA NA NA SARSr-CoV   (+) ssRNA 0.029751 14 40.8 Rhinolophuspusillus Rhinolophus pusillus 1 0.029751 40.8 13
7613 Virus NA NA Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Yes Yes No Yes Chiroptera Chiroptera Rhinolophidae Rhinolophus sinicus Chinese rufous horseshoe bat na PCR   17451830 ViPR     No Yes Jones No NA NA NA NA NA NA SARSr-CoV   (+) ssRNA 0.029751 14 40.8 Rhinolophussinicus Rhinolophus sinicus 1 0.029751 40.8 13
7615 Virus NA NA Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Yes Yes No Yes Ungulates Artiodactyla Suidae Sus scrofa Wild boar na PCR Wang et al. 2005 17451830       No Yes Jones No NA NA NA NA NA NA SARSr-CoV   (+) ssRNA 0.029751 14 40.8 Susscrofa Sus scrofa 1 0.029751 40.8 13

shaw2020Notes17451830.tsv.txt

PathogenVsHostDB-2019-05-30.csv

jhpoelen commented 3 years ago

fyi @kephelps @n8upham

liampshaw commented 3 years ago

Thanks @jhpoelen for flagging this. It's a good question. I believe that where there is a "short reference" that is to another paper that is cited within the paper referred to by the PMID. Here, that is the review paper of Shi & Hu (PMID: 17451830). For example, the first row of the table you include means that Shi & Hu cite Guan et al. 2003 as evidence of SARS-like coronavirus infection of Melogale moschata. (Guan et al. 2003 itself has PMID: 12958366.)

I appreciate this might not be as helpful as it could be. I'm double-checking this with my colleagues who completed the literature review - I appreciate we should make this clear!

jhpoelen commented 3 years ago

@liampshaw Thanks for taking the time to respond and for explaining that the reference and the PMID are in the same chain of evidence provenance, but not necessarily the same. As far as I understand, your "reference" column refers to primary (or at least more primary) literature that documents evidence of the claim.

I imagine that some review articles cite other review articles that . . . etc. I hope to work with you and your colleagues to figure out a pragmatic way to capture these chains of evidence. I bet some claims cited in numerous papers hinge on a single specimen that was collected in some cave in the 60s. That would be good to know right?

Curious to hear your thoughts.

liampshaw commented 3 years ago

@jhpoelen - yes that's my understanding. I'll update this with thoughts from my colleagues once I have them.

I think you are right about the "single specimen in a cave" problem. We aimed to be as comprehensive as possible so included such cases from manual literature review, but it does feel uneasy to treat all associations equally.

I agree that working out a realistic and reliable way to represent those chains would be extremely valuable. It would be great if this could become automatic rather than manual. Thinking out loud...: I'm not an expert on citation-tracing but suspect generating a temporal network of citing articles is relatively OK. But then I'm not sure how much it helps in itself, because one would expect all studies of a certain association to cite the earliest study - even if they themselves provide new evidence. There might be a way of quantifying the network diversity that in some way reflects the strength of evidence chain: citation network summary statistics could be computed for all associations and then linked to manual assessment of the evidence strength (e.g. for n=100 associations) to give some heuristics for what 'strong' and 'weak' chains look like in network terms. I haven't paid attention to recent studies that don't use manual literature review so not sure what the state-of-the-art is.

Would be interested in your ideas on this!

jhpoelen commented 3 years ago

also cc-ing @arw36 Anna W. who did a ton of work on the HP3 dataset (aka Olival2017).

Aehtela commented 3 years ago

@jhpoelen

Thanks for bringing up this issue. As @liampshaw has already noted, "PMID"!="Reference", although there are certainly entries for which the primary literature listed in "Reference" DOES correspond to the primary literature listed in "PMID".

As you can see, there are three Primary Literature (pathogen-host association source information) related columns in the database - “Reference”, “PMID” and “Additional”. Of these, the most consistent would be the “PMID” column - almost all pathogen-host associations in the database have a PMID entry and if there isn’t one, then the identifier for a Primary Literature source will be listed in the “Additional” column (these are generally JSTOR links, or doi, etc, for which the source literature cited does not have a PMID). In some cases, the "Additional" column may also have another PMID identifier listed - this is the case for any pathogen-host association in which the primary literature listed in the "PMID" entry only gave evidence of isolation of the pathogen from a host animal but did not have information on the pathogenicity.

The "Reference" column is somewhat of a vestigial information dump that I kept from earlier versions of the database. When I first started compiling the database, there was only a single “Source” column, into which I would enter a short reference (e.g., Taylor et al. 2001) for the Primary Literature through which I identified a pathogen-host association.

I eventually realized that this was a major oversight as short references are not nearly sufficient to look up and check the source of information accurately. Hence the "PMID" and "Additional" column were added to the database to ensure that anyone could trace the source information easily. However, I decided to retain the original “Source” column, and renamed it as "Reference". For any pathogen-host associations that were added to the database after I started recording PMID's (or alternative standard identifiers), I would then just enter these rather than a short reference in the "Reference" column - which is why many rows have "Reference" as a blank.

You are also absolutely correct that the strength of evidence for all pathogen-host associations is certainly uneven. At one point we considered adding a column that counted the total number of citation hits for a pathogen-host association (e.g., Severe acute respiratory syndrome-related coronavirus AND Rhinolophus pusillus) but ended up abandoning this idea. Unfortunately, we never did zero in on a pragmatic way of capturing this information. We did however, try to capture the "strength of information" provided in the Primary Literature cited in the "Association" (clear evidence of pathogenicity vs. not mentioned) and "Method" (antibodies vs. PCR) columns.

It would be great to figure something out eventually!

Please let me know if anything else comes up or remains unclear.

jhpoelen commented 3 years ago

@Aehtela thanks for taking the time to explain the history behind the reference, pmid and additional columns.

Thanks for having the foresight to include the "PMID" and "Additional" columns.

I'll have to think a little more about how to make it easier to index your dataset now that I better understand your table schema/design.

@Aehtela @liampshaw Do you mind keeping this issue open for now?

liampshaw commented 3 years ago

@jhpoelen - fine to keep it open, and worth doing so for others to be made aware how this works. It seems unlikely from my end that I'll change anything about the stored dataset in response to the issue, but I think it is definitely an 'open' problem how best to deal with it for future analyses.