your indexed records for Guan et al. 2003

jhpoelen commented 3 years ago

Hi @liampshaw -

Thanks for helping to make existing biotic interaction data easier to find and access!

I was just looking at your GloBI indexed record at https://www.globalbioticinteractions.org/?interactionType=interactsWith&sourceTaxon=Rhinolophus%20pusillus&targetTaxon=Coronaviridae and I was wondering about how your are handling you pubmed references.

For instance, on 2021-10-09T05:33:05.476Z , I accessed attached file with hash://sha256/1176267c7468d009a552e38ae9c0c247f723a65b67b68c72f5366af2e882e8cf at https://github.com/liampshaw/Pathogen-host-range/raw/master/data/PathogenVsHostDB-2019-05-30.csv .

I noticed that a pubmed reference id 17451830 associated with (short) reference Guan et al. 2003, but also Wang et al. 2005 and [blank] .

I produced the table below using:

$ cat 1176267c7468d009a552e38ae9c0c247f723a65b67b68c72f5366af2e882e8cf | tr '\r' '\n' | head -n1 > shaw2020Notes17451830.tsv
$ cat 1176267c7468d009a552e38ae9c0c247f723a65b67b68c72f5366af2e882e8cf | tr '\r' '\n' | grep "17451830" >> shaw2020Notes17451830.tsv

However, when looking up https://pubmed.ncbi.nlm.nih.gov/17451830/ , it appears that the authors of the review article are Zhengli Shi 1, Zhihong Hu .

Can you please help me understand how the short reference and the PMID are related?

	Type	Phylum	Class	Order	Family	Genus	Species	Human	Zoonotic	Domestic	Wildlife	HostGroup	HostOrder	HostFamily	HostSpecies	HostName	Association	Method	Reference	PMID	Additional	Taylor	Emerging	Emerging.Source	VectorBorne	GramStain	Motility	Spore	Oxygen	Cell	Infection	Abbreviation	Genome	Gsize	ProteinCount	PercentGC	HostSpeciesPHB	HostSpecies.new	N.genomes	Genome.size	Genome.GC	Genome.genes
7604	Virus	NA	NA	Nidovirales	Coronaviridae	Betacoronavirus	Severe acute respiratory syndrome-related coronavirus	Yes	Yes	No	Yes	Carnivora	Carnivora	Mustelidae	Melogale moschata	Chinese ferret-badger	na	Antibodies	Guan et al. 2003	17451830		No	Yes	Jones	No	NA	NA	NA	NA	NA	NA	SARSr-CoV	(+) ssRNA	0.029751	14	40.8	Melogalemoschata	Melogale moschata	1	0.029751	40.8	13
7606	Virus	NA	NA	Nidovirales	Coronaviridae	Betacoronavirus	Severe acute respiratory syndrome-related coronavirus	Yes	Yes	No	Yes	Chiroptera	Chiroptera	Pteropodidae	Rousettus leschenaultii	Leschenault's rousette	na	Antibodies		17451830		No	Yes	Jones	No	NA	NA	NA	NA	NA	NA	SARSr-CoV	(+) ssRNA	0.029751	14	40.8	Rousettusleschenaultii	Rousettus leschenaultii	1	0.029751	40.8	13
7610	Virus	NA	NA	Nidovirales	Coronaviridae	Betacoronavirus	Severe acute respiratory syndrome-related coronavirus	Yes	Yes	No	Yes	Chiroptera	Chiroptera	Rhinolophidae	Rhinolophus macrotis	Big-eared horseshoe bat	na	PCR		17451830	ViPR	No	Yes	Jones	No	NA	NA	NA	NA	NA	NA	SARSr-CoV	(+) ssRNA	0.029751	14	40.8	Rhinolophusmacrotis	Rhinolophus macrotis	1	0.029751	40.8	13
7612	Virus	NA	NA	Nidovirales	Coronaviridae	Betacoronavirus	Severe acute respiratory syndrome-related coronavirus	Yes	Yes	No	Yes	Chiroptera	Chiroptera	Rhinolophidae	Rhinolophus pusillus	Least horseshoe bat	na	PCR		17451830	ViPR	No	Yes	Jones	No	NA	NA	NA	NA	NA	NA	SARSr-CoV	(+) ssRNA	0.029751	14	40.8	Rhinolophuspusillus	Rhinolophus pusillus	1	0.029751	40.8	13
7613	Virus	NA	NA	Nidovirales	Coronaviridae	Betacoronavirus	Severe acute respiratory syndrome-related coronavirus	Yes	Yes	No	Yes	Chiroptera	Chiroptera	Rhinolophidae	Rhinolophus sinicus	Chinese rufous horseshoe bat	na	PCR		17451830	ViPR	No	Yes	Jones	No	NA	NA	NA	NA	NA	NA	SARSr-CoV	(+) ssRNA	0.029751	14	40.8	Rhinolophussinicus	Rhinolophus sinicus	1	0.029751	40.8	13
7615	Virus	NA	NA	Nidovirales	Coronaviridae	Betacoronavirus	Severe acute respiratory syndrome-related coronavirus	Yes	Yes	No	Yes	Ungulates	Artiodactyla	Suidae	Sus scrofa	Wild boar	na	PCR	Wang et al. 2005	17451830		No	Yes	Jones	No	NA	NA	NA	NA	NA	NA	SARSr-CoV	(+) ssRNA	0.029751	14	40.8	Susscrofa	Sus scrofa	1	0.029751	40.8	13

shaw2020Notes17451830.tsv.txt

PathogenVsHostDB-2019-05-30.csv

jhpoelen commented 3 years ago

fyi @kephelps @n8upham

liampshaw commented 3 years ago

Thanks @jhpoelen for flagging this. It's a good question. I believe that where there is a "short reference" that is to another paper that is cited within the paper referred to by the PMID. Here, that is the review paper of Shi & Hu (PMID: 17451830). For example, the first row of the table you include means that Shi & Hu cite Guan et al. 2003 as evidence of SARS-like coronavirus infection of Melogale moschata. (Guan et al. 2003 itself has PMID: 12958366.)

I appreciate this might not be as helpful as it could be. I'm double-checking this with my colleagues who completed the literature review - I appreciate we should make this clear!

jhpoelen commented 3 years ago

@liampshaw Thanks for taking the time to respond and for explaining that the reference and the PMID are in the same chain of evidence provenance, but not necessarily the same. As far as I understand, your "reference" column refers to primary (or at least more primary) literature that documents evidence of the claim.

I imagine that some review articles cite other review articles that . . . etc. I hope to work with you and your colleagues to figure out a pragmatic way to capture these chains of evidence. I bet some claims cited in numerous papers hinge on a single specimen that was collected in some cave in the 60s. That would be good to know right?

Curious to hear your thoughts.

liampshaw commented 3 years ago

@jhpoelen - yes that's my understanding. I'll update this with thoughts from my colleagues once I have them.

I think you are right about the "single specimen in a cave" problem. We aimed to be as comprehensive as possible so included such cases from manual literature review, but it does feel uneasy to treat all associations equally.

I agree that working out a realistic and reliable way to represent those chains would be extremely valuable. It would be great if this could become automatic rather than manual. Thinking out loud...: I'm not an expert on citation-tracing but suspect generating a temporal network of citing articles is relatively OK. But then I'm not sure how much it helps in itself, because one would expect all studies of a certain association to cite the earliest study - even if they themselves provide new evidence. There might be a way of quantifying the network diversity that in some way reflects the strength of evidence chain: citation network summary statistics could be computed for all associations and then linked to manual assessment of the evidence strength (e.g. for n=100 associations) to give some heuristics for what 'strong' and 'weak' chains look like in network terms. I haven't paid attention to recent studies that don't use manual literature review so not sure what the state-of-the-art is.

Would be interested in your ideas on this!

jhpoelen commented 3 years ago

also cc-ing @arw36 Anna W. who did a ton of work on the HP3 dataset (aka Olival2017).

Aehtela commented 3 years ago

@jhpoelen

Thanks for bringing up this issue. As @liampshaw has already noted, "PMID"!="Reference", although there are certainly entries for which the primary literature listed in "Reference" DOES correspond to the primary literature listed in "PMID".

As you can see, there are three Primary Literature (pathogen-host association source information) related columns in the database - “Reference”, “PMID” and “Additional”. Of these, the most consistent would be the “PMID” column - almost all pathogen-host associations in the database have a PMID entry and if there isn’t one, then the identifier for a Primary Literature source will be listed in the “Additional” column (these are generally JSTOR links, or doi, etc, for which the source literature cited does not have a PMID). In some cases, the "Additional" column may also have another PMID identifier listed - this is the case for any pathogen-host association in which the primary literature listed in the "PMID" entry only gave evidence of isolation of the pathogen from a host animal but did not have information on the pathogenicity.

The "Reference" column is somewhat of a vestigial information dump that I kept from earlier versions of the database. When I first started compiling the database, there was only a single “Source” column, into which I would enter a short reference (e.g., Taylor et al. 2001) for the Primary Literature through which I identified a pathogen-host association.

I eventually realized that this was a major oversight as short references are not nearly sufficient to look up and check the source of information accurately. Hence the "PMID" and "Additional" column were added to the database to ensure that anyone could trace the source information easily. However, I decided to retain the original “Source” column, and renamed it as "Reference". For any pathogen-host associations that were added to the database after I started recording PMID's (or alternative standard identifiers), I would then just enter these rather than a short reference in the "Reference" column - which is why many rows have "Reference" as a blank.

You are also absolutely correct that the strength of evidence for all pathogen-host associations is certainly uneven. At one point we considered adding a column that counted the total number of citation hits for a pathogen-host association (e.g., Severe acute respiratory syndrome-related coronavirus AND Rhinolophus pusillus) but ended up abandoning this idea. Unfortunately, we never did zero in on a pragmatic way of capturing this information. We did however, try to capture the "strength of information" provided in the Primary Literature cited in the "Association" (clear evidence of pathogenicity vs. not mentioned) and "Method" (antibodies vs. PCR) columns.

It would be great to figure something out eventually!

Please let me know if anything else comes up or remains unclear.

jhpoelen commented 3 years ago

@Aehtela thanks for taking the time to explain the history behind the reference, pmid and additional columns.

Thanks for having the foresight to include the "PMID" and "Additional" columns.

I'll have to think a little more about how to make it easier to index your dataset now that I better understand your table schema/design.

@Aehtela @liampshaw Do you mind keeping this issue open for now?

liampshaw commented 3 years ago

@jhpoelen - fine to keep it open, and worth doing so for others to be made aware how this works. It seems unlikely from my end that I'll change anything about the stored dataset in response to the issue, but I think it is definitely an 'open' problem how best to deal with it for future analyses.

liampshaw / Pathogen-host-range

your indexed records for Guan et al. 2003 #4