Open jhpoelen opened 3 years ago
fyi @kephelps @n8upham
Thanks @jhpoelen for flagging this. It's a good question. I believe that where there is a "short reference" that is to another paper that is cited within the paper referred to by the PMID. Here, that is the review paper of Shi & Hu (PMID: 17451830). For example, the first row of the table you include means that Shi & Hu cite Guan et al. 2003 as evidence of SARS-like coronavirus infection of Melogale moschata. (Guan et al. 2003 itself has PMID: 12958366.)
I appreciate this might not be as helpful as it could be. I'm double-checking this with my colleagues who completed the literature review - I appreciate we should make this clear!
@liampshaw Thanks for taking the time to respond and for explaining that the reference and the PMID are in the same chain of evidence provenance, but not necessarily the same. As far as I understand, your "reference" column refers to primary (or at least more primary) literature that documents evidence of the claim.
I imagine that some review articles cite other review articles that . . . etc. I hope to work with you and your colleagues to figure out a pragmatic way to capture these chains of evidence. I bet some claims cited in numerous papers hinge on a single specimen that was collected in some cave in the 60s. That would be good to know right?
Curious to hear your thoughts.
@jhpoelen - yes that's my understanding. I'll update this with thoughts from my colleagues once I have them.
I think you are right about the "single specimen in a cave" problem. We aimed to be as comprehensive as possible so included such cases from manual literature review, but it does feel uneasy to treat all associations equally.
I agree that working out a realistic and reliable way to represent those chains would be extremely valuable. It would be great if this could become automatic rather than manual. Thinking out loud...: I'm not an expert on citation-tracing but suspect generating a temporal network of citing articles is relatively OK. But then I'm not sure how much it helps in itself, because one would expect all studies of a certain association to cite the earliest study - even if they themselves provide new evidence. There might be a way of quantifying the network diversity that in some way reflects the strength of evidence chain: citation network summary statistics could be computed for all associations and then linked to manual assessment of the evidence strength (e.g. for n=100 associations) to give some heuristics for what 'strong' and 'weak' chains look like in network terms. I haven't paid attention to recent studies that don't use manual literature review so not sure what the state-of-the-art is.
Would be interested in your ideas on this!
also cc-ing @arw36 Anna W. who did a ton of work on the HP3 dataset (aka Olival2017).
@jhpoelen
Thanks for bringing up this issue. As @liampshaw has already noted, "PMID"!="Reference", although there are certainly entries for which the primary literature listed in "Reference" DOES correspond to the primary literature listed in "PMID".
As you can see, there are three Primary Literature (pathogen-host association source information) related columns in the database - “Reference”, “PMID” and “Additional”. Of these, the most consistent would be the “PMID” column - almost all pathogen-host associations in the database have a PMID entry and if there isn’t one, then the identifier for a Primary Literature source will be listed in the “Additional” column (these are generally JSTOR links, or doi, etc, for which the source literature cited does not have a PMID). In some cases, the "Additional" column may also have another PMID identifier listed - this is the case for any pathogen-host association in which the primary literature listed in the "PMID" entry only gave evidence of isolation of the pathogen from a host animal but did not have information on the pathogenicity.
The "Reference" column is somewhat of a vestigial information dump that I kept from earlier versions of the database. When I first started compiling the database, there was only a single “Source” column, into which I would enter a short reference (e.g., Taylor et al. 2001) for the Primary Literature through which I identified a pathogen-host association.
I eventually realized that this was a major oversight as short references are not nearly sufficient to look up and check the source of information accurately. Hence the "PMID" and "Additional" column were added to the database to ensure that anyone could trace the source information easily. However, I decided to retain the original “Source” column, and renamed it as "Reference". For any pathogen-host associations that were added to the database after I started recording PMID's (or alternative standard identifiers), I would then just enter these rather than a short reference in the "Reference" column - which is why many rows have "Reference" as a blank.
You are also absolutely correct that the strength of evidence for all pathogen-host associations is certainly uneven. At one point we considered adding a column that counted the total number of citation hits for a pathogen-host association (e.g., Severe acute respiratory syndrome-related coronavirus AND Rhinolophus pusillus) but ended up abandoning this idea. Unfortunately, we never did zero in on a pragmatic way of capturing this information. We did however, try to capture the "strength of information" provided in the Primary Literature cited in the "Association" (clear evidence of pathogenicity vs. not mentioned) and "Method" (antibodies vs. PCR) columns.
It would be great to figure something out eventually!
Please let me know if anything else comes up or remains unclear.
@Aehtela thanks for taking the time to explain the history behind the reference, pmid and additional columns.
Thanks for having the foresight to include the "PMID" and "Additional" columns.
I'll have to think a little more about how to make it easier to index your dataset now that I better understand your table schema/design.
@Aehtela @liampshaw Do you mind keeping this issue open for now?
@jhpoelen - fine to keep it open, and worth doing so for others to be made aware how this works. It seems unlikely from my end that I'll change anything about the stored dataset in response to the issue, but I think it is definitely an 'open' problem how best to deal with it for future analyses.
Hi @liampshaw -
Thanks for helping to make existing biotic interaction data easier to find and access!
I was just looking at your GloBI indexed record at https://www.globalbioticinteractions.org/?interactionType=interactsWith&sourceTaxon=Rhinolophus%20pusillus&targetTaxon=Coronaviridae and I was wondering about how your are handling you pubmed references.
For instance, on 2021-10-09T05:33:05.476Z , I accessed attached file with hash://sha256/1176267c7468d009a552e38ae9c0c247f723a65b67b68c72f5366af2e882e8cf at https://github.com/liampshaw/Pathogen-host-range/raw/master/data/PathogenVsHostDB-2019-05-30.csv .
I noticed that a pubmed reference id 17451830 associated with (short) reference
Guan et al. 2003
, but alsoWang et al. 2005
and[blank]
.I produced the table below using:
However, when looking up https://pubmed.ncbi.nlm.nih.gov/17451830/ , it appears that the authors of the review article are Zhengli Shi 1, Zhihong Hu .
Can you please help me understand how the short reference and the PMID are related?
shaw2020Notes17451830.tsv.txt
PathogenVsHostDB-2019-05-30.csv