Check out PubTator Commons for augmenting crawler features

jaeddy commented 4 years ago

@bswhite, @Tumpsh, @vpchung — I saw "PubTator" presented at a conference last year, but it seemed a bit... janky. However, there's apparently a newer version called PubTator Central that looks more polished and fully featured. In addition to providing a REST API, the tool also now provides annotations based on the full article text (for anything that's in PMC).

Here's an example output for a random PMID from CSBC/PS-ON:

Anyway, might be worth thinking about how to integrate these results with the current PubMed data we analyze.

vpchung commented 4 years ago

I like this!! It's so fast too.

What sort of features were you thinking?

jaeddy commented 4 years ago

Good question.. I need to take a closer look at what the current PubMed crawler code is doing (and how the results are getting stored). At first glance, it seems potentially useful to grab and cross-reference "disease" annotations from PubTator — especially in cases where it's not obvious from the abstract alone. Capturing "gene" references might also be interesting, though we don't currently have any relevant features to present that info in the portal.

vpchung commented 2 years ago

Before proceeding, I wanted to review what the current tools will offer.

Data to Collect	Web scraping/Entrez (current)	PubTator API	Europe PMC Articles API
Grant number	✓		✓
DOI	✓	✓	✓
Journal	✓	✓	✓
PMID	✓	✓	✓
Title	✓	✓	✓
Pub. year	✓	✓	✓
Keywords	✓		✓
Authors	✓	✓	✓
Abstract	✓	✓	✓
Datasets	✓
MESH Terms	✓	✓	✓
Is open-access?			✓

Given this information, I think I will move forward with utilizing PMC's API, in conjunction with Entrez to retrieve the related datasets. With web scraping, if the article page is too long (and therefore, will require JavaScript to load the entire page), then utilizing requests alone will not be enough. This will results in incomplete rows in the manifest, e.g. missing grant information since those appear at the bottom of the page.

vpchung commented 2 years ago

Closing this ticket now.

mc2-center / pubmed-crawler

Check out PubTator Commons for augmenting crawler features #3