mc2-center / pubmed-crawler

PubMed Crawler for CCKP publication manifest
Creative Commons Zero v1.0 Universal
1 stars 0 forks source link

Check out PubTator Commons for augmenting crawler features #3

Closed jaeddy closed 2 years ago

jaeddy commented 4 years ago

@bswhite, @Tumpsh, @vpchung — I saw "PubTator" presented at a conference last year, but it seemed a bit... janky. However, there's apparently a newer version called PubTator Central that looks more polished and fully featured. In addition to providing a REST API, the tool also now provides annotations based on the full article text (for anything that's in PMC).

Here's an example output for a random PMID from CSBC/PS-ON: image

Anyway, might be worth thinking about how to integrate these results with the current PubMed data we analyze.

vpchung commented 4 years ago

I like this!! It's so fast too.

What sort of features were you thinking?

jaeddy commented 4 years ago

Good question.. I need to take a closer look at what the current PubMed crawler code is doing (and how the results are getting stored). At first glance, it seems potentially useful to grab and cross-reference "disease" annotations from PubTator — especially in cases where it's not obvious from the abstract alone. Capturing "gene" references might also be interesting, though we don't currently have any relevant features to present that info in the portal.

vpchung commented 2 years ago

Before proceeding, I wanted to review what the current tools will offer.

Data to Collect Web scraping/Entrez (current) PubTator API Europe PMC Articles API
Grant number
DOI
Journal
PMID
Title
Pub. year
Keywords
Authors
Abstract
Datasets
MESH Terms
Is open-access?

Given this information, I think I will move forward with utilizing PMC's API, in conjunction with Entrez to retrieve the related datasets. With web scraping, if the article page is too long (and therefore, will require JavaScript to load the entire page), then utilizing requests alone will not be enough. This will results in incomplete rows in the manifest, e.g. missing grant information since those appear at the bottom of the page.

vpchung commented 2 years ago

Closing this ticket now.