jakelever / biotext

Get a nicely-chunked local copy of the biomedical literature (to use for other projects)!
MIT License
13 stars 5 forks source link

New PMC data format (baseline + increments) #6

Closed jakelever closed 1 year ago

jakelever commented 2 years ago

As noted in #5, there is a new PMC bulk download format described at https://www.ncbi.nlm.nih.gov/pmc/about/new-in-pmc/#2021-09-21. We'll need to make a few adjustments to how PMC data is dealt with. Our code actually does its own baseline + increments system, so it'll mostly be cutting out code that isn't needed anymore.

PubMed Central (PMC) has made significant improvements to the bulk retrieval of two of the PMC Article Datasets from our FTP service. The improvements were made to bulk packages which include metadata and full text files of articles in XML or plain text formats for the PMC Open Access (OA) Subset and the Author Manuscript Dataset, which combined encompass more than half of the 7 million articles in PMC. To improve the usability of these two datasets, PMC has redesigned the bulk download directory structure and file packages on our FTP service. The new structure includes:

  • baseline packages that contain all articles available in PMC as of the baseline date for each respective dataset or grouping; and
  • daily incremental packages for each respective dataset or grouping that only contain articles that are new to the dataset or that have been updated since the baseline or previous incremental file was created.
jakelever commented 1 year ago

This was dealt with