Closed jaeddy closed 2 years ago
I like this!! It's so fast too.
What sort of features were you thinking?
Good question.. I need to take a closer look at what the current PubMed crawler code is doing (and how the results are getting stored). At first glance, it seems potentially useful to grab and cross-reference "disease" annotations from PubTator — especially in cases where it's not obvious from the abstract alone. Capturing "gene" references might also be interesting, though we don't currently have any relevant features to present that info in the portal.
Before proceeding, I wanted to review what the current tools will offer.
Data to Collect | Web scraping/Entrez (current) | PubTator API | Europe PMC Articles API |
---|---|---|---|
Grant number | ✓ | ✓ | |
DOI | ✓ | ✓ | ✓ |
Journal | ✓ | ✓ | ✓ |
PMID | ✓ | ✓ | ✓ |
Title | ✓ | ✓ | ✓ |
Pub. year | ✓ | ✓ | ✓ |
Keywords | ✓ | ✓ | |
Authors | ✓ | ✓ | ✓ |
Abstract | ✓ | ✓ | ✓ |
Datasets | ✓ | ||
MESH Terms | ✓ | ✓ | ✓ |
Is open-access? | ✓ |
Given this information, I think I will move forward with utilizing PMC's API, in conjunction with Entrez to retrieve the related datasets. With web scraping, if the article page is too long (and therefore, will require JavaScript to load the entire page), then utilizing requests
alone will not be enough. This will results in incomplete rows in the manifest, e.g. missing grant information since those appear at the bottom of the page.
Closing this ticket now.
@bswhite, @Tumpsh, @vpchung — I saw "PubTator" presented at a conference last year, but it seemed a bit... janky. However, there's apparently a newer version called PubTator Central that looks more polished and fully featured. In addition to providing a REST API, the tool also now provides annotations based on the full article text (for anything that's in PMC).
Here's an example output for a random PMID from CSBC/PS-ON:
Anyway, might be worth thinking about how to integrate these results with the current PubMed data we analyze.