sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

Investigate any extra data that is available via Pubmed API #1668

Open peetucket opened 6 months ago

peetucket commented 6 months ago

Question from Tina in Slack:

I am looking into what information is available to us from PubMed from the import.  
When reviewing the PubMed website, for example https://pubmed.ncbi.nlm.nih.gov/26422724/, 
I can see information such as Cited by, Associated Data, Related Information, and Grants and funding.  
Is any of this information available in the feed we get from PubMed?  Thank you.  
(note:  the PubMed link above is not from a SoM profile, so they are not part of Stanford.)

Investigate what comes back from the Pubmed API, and is there a way we can quest extra data (such as "Cited by", "Related information", "Funding", etc.) See the web page results view for the fields shown.

peetucket commented 6 months ago

I just took a quick look at what the Pubmed API sends us back for the example record above. I'll attach the full XML response below, but as I quickly scanned it, I didn't see the extra data (Cited by, Related pubs, Grants, Funding, etc.) I did see the reference list in a simple text citation format. At the moment, when we parse that XML we only store what is needed for the current data model response we send back to Profiles, the other parts of the XML are ignored.

rec = Pubmed::Client.new.fetch_records_for_pmid_list('26422724')
puts rec

I briefly looked at the Pubmed API documentation (https://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Searching_a_Database) and didn't see any obvious extra params we could send to increase the amount of data returned. Could be an area of more investigation.

Suggest a bit further investigation/reading.

pubmed_26422724.xml.zip

edsu commented 6 months ago

The Cited By results in https://pubmed.ncbi.nlm.nih.gov/26422724/ appear to all be citations from other PubMed articles? It looks like they have a separate API endpoint for those, which would require another lookup by ID to get the metadata?

$ curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&linkname=pubmed_pmc_refs&id=26422724" | xmllint --format -
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD elink 20101123//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20101123/elink.dtd">
<eLinkResult>
  <LinkSet>
    <DbFrom>pubmed</DbFrom>
    <IdList>
      <Id>26422724</Id>
    </IdList>
    <LinkSetDb>
      <DbTo>pmc</DbTo>
      <LinkName>pubmed_pmc_refs</LinkName>
      <Link>
        <Id>10676511</Id>
      </Link>
      <Link>
        <Id>10650975</Id>
      </Link>
      <Link>
        <Id>10439485</Id>
      </Link>
      <Link>
        <Id>10275576</Id>
      </Link>
      <Link>
        <Id>10118745</Id>
      </Link>
      <Link>
        <Id>10106992</Id>
      </Link>
      <Link>
        <Id>10080461</Id>
      </Link>
      ...
  </LinkSet>
</eLinkResult>

For the Associated Data it looks like the item that was mentioned is in the XML, but would require some kind of look up to get

<DataBankList CompleteYN="Y">
  <DataBank>
    <DataBankName>ClinicalTrials.gov</DataBankName>
    <AccessionNumberList>
      <AccessionNumber>NCT01681875</AccessionNumber>
    </AccessionNumberList>
  </DataBank>
  ...
</DataSetBankList>

The grants are in another XML stanza:

<GrantList CompleteYN="Y">
  <Grant>
    <GrantID>P30 CA077598</GrantID>
    <Acronym>CA</Acronym>
    <Agency>NCI NIH HHS</Agency>
    <Country>United States</Country>
  </Grant>
  <Grant>
    <GrantID>P30 ES013508</GrantID>
    <Acronym>ES</Acronym>
    <Agency>NIEHS NIH HHS</Agency>
    <Country>United States</Country>
  </Grant>
  <Grant>
    <GrantID>U54 DA031659</GrantID>
    <Acronym>DA</Acronym>
    <Agency>NIDA NIH HHS</Agency>
    <Country>United States</Country>
  </Grant>
</GrantList>

The Related Information appears to use the article ID to link out to various services?

Perhaps some of these have APIs that could be queried if they have valuable information.

Hopefully this helps a bit?