sciencefair-land / datasource-requests

Repo for requesting ScienceFair datasources, and discussing how to create them
https://github.com/codeforscience/sciencefair
5 stars 0 forks source link

PubMed Central #1

Open blahah opened 7 years ago

blahah commented 7 years ago

PubMed Central provides JATS XML for 4.4 million articles. Not all of these are OpenAccess, so we will start with the ones that are definitely OK to distribute.

I propose to create a live-updating PMC datasource.

It will also be broken down into one datasource per journal. This will be achieved by generating a separate meta feed for each journal, but sharing the articles feed for PMC between all the sources.

There are 2017 fully participating journals, 4299 selectively depositing journals, and 414 journals that deposit all NIH-funded works). Here's a list of journals: https://www.ncbi.nlm.nih.gov/pmc/journals/.

Methodology (main source)

To create and maintain the PMC main feed:

Methodology (sub-sources)

To allow creation of sub-feeds:

This setup should ideally have a web interface for finding sub-feeds, or creating them on the fly.

todrobbins commented 7 years ago

❤️❤️❤️

SirRujak commented 5 years ago

I have been looking into how this one might be reasonable recently and I think I have come up with an idea. Searching through the PubMed Central site I finally ran across this listing. According to that, the full JATS XML directory is currently sitting at 7.92 Tb (or just under 1 TB) which is pretty reasonable for a single individual to keep a full copy of for backup currently. The problem is that extracting it in a similar way that the eLife papers are extracted would considerably increase the storage size (up to possibly 3TB or) which would be considerably more difficult to deal with. I propose then that this database is kept in its compressed form for the dat and extra functionality is added to sciencefair for decompressing and extracting the data into memory on the fly.

I have been looking into this somewhat and it appears that the two main steps for this to happen are:

Decompressing the gz stream (https://nodejs.org/api/zlib.html#zlib_class_zlib_gunzip)

Extracting the components from the tar (https://www.npmjs.com/package/tar#class-tarparse)

The links on each one are the methods I have found that might work for this purpose. I am also considering exactly how to go about creating the listing for this dataset. I know @blahah was trying to create a hyperdb keystore for the eLife database but i am not entirely familiar with how one would go about that yet so any help would be appreciated!

At the moment I am planning on just trying to get the entire databse downloaded along with its metadata and a simple csv file that has all of the article PMID's along with the paths to their tar.gz files and the metadata files provided by PubMed Central. I believe that we will need to reformat their metadata json files to conform to the sciencefair ones but I am thinking that it may be better to keep dat of both the original and sciencefair ones in case the sciencefair format ever decides to change.

As I mentioned in this issue thread I am using golang and a slightly different method for retrieving the article listing but I am getting pretty close to where I could try and actually download and begin sharing the dataset. I am hopeful to have the code in a state to post in a day or two but I would be happy to share what I currently have to anyone that wants it currently to make sure it doesn't disappear!

SirRujak commented 5 years ago

I have made some progress! You can find what I have done here. It is not tested yet for the most part and there are a couple known issues if there is an issue with DNS lookups and it is not currently keeping track of its location in the download. Also it is only downloading one set right now so I don't spam their system testing. If anyone is interested in taking a look at it I would appreciate it and hopefully in the next few days I can get most of those issues ironed out and actually start downloading the dataset.

As a side note I did have to make some changes with the metadata format so I will have to eventually make pull requests here to get those working. Roughly I just had to change it so that sciencefair could use the compressed articles and a different path naming convention but they should be very minor changes that only change anything when a database specifies that those extra capabilities are necessary.

bencevans commented 5 years ago

Great work @SirRujak :tada:

SirRujak commented 5 years ago

Thanks! I have made a decent amount of progress on the dataset. I am able to now start importing the metadata for my test set and get to 22% sync rate before hitting an error. I haven't been able to completely track it back to the source yet but I am guessing it may be that some of the articles don't have a DOI in their database.

undefined
message
:
"key cannot be an empty String"
name
:
"WriteError"
stack
:
"WriteError: key cannot be an empty String↵    at /snap/sirrujak-sciencefair/3/resources/app.asar/node_modules/levelup/lib/levelup.js:233:34↵    at LevelDOWN.AbstractLevelDOWN.put (/snap/sirrujak-sciencefair/3/resources/app.asar/node_modules/level/node_modules/abstract-leveldown/abstract-leveldown.js:108:12)↵    at LevelUP.put (/snap/sirrujak-sciencefair/3/resources/app.asar/node_modules/levelup/lib/levelup.js:231:11)↵    at drain (/snap/sirrujak-sciencefair/3/resources/app.asar/node_modules/level-write-stream/index.js:28:20)↵    at _combinedTickCallback (internal/process/next_tick.js:131:7)↵    at process._tickCallback (internal/process/next_tick.js:180:9)"
type
:
"WriteError"

That is the specific error given which appears to be when the system is attempting to put a new key in the leveldb and it so happens to be an empty string. I'm sure I can find it before too long but if anyone knows for sure that it is an issue with missing DOI info or something else it could definitely save me some time!

Edit: It does appear to have been the articles lacking DOI that was causing that error. I am going to eventually have to find a way to collect at least that bit of information on articles that are missing it.

I have now ran into a new issue though. I'm not sure if it is just due to all the test runs I have had to do or not, but I'll see if I can get it to work. On the positive side, the last test run I did at least made it to 34% so we do appear to be making progress!

Uncaught TypeError: self.author.map is not a function
at Paper.self.metadata (/snap/sirrujak-sciencefair/3/resources/app.asar/client/lib/paper.js:133:30)
SirRujak commented 5 years ago

Given the issues with getting functional builds going again I haven't been able to work on this much. Hopefully my most recent work on getting sciencefair working will let me get back to working on this particular project. Also to update, I did find the most recent issue from the last comment, it had to do with unknown author names being stored as a string rather than the object that is expected later in the script. The change to fix it may cause some other areas to not work quite as expected but there should be no breaking issues.

blahah commented 5 years ago

👋 hey @SirRujak and @bencevans!

This is all so cool! I have seen and really appreciate your contributions and interest. Sadly I have been unable to give ScienceFair the love it deserves this year - I hope to change that in the new year. It would be really great to co-ordinate with both of you - would you be able to drop by the ScienceFair gitter room (https://gitter.im/sciencefair-app/Lobby) for a chat?

SirRujak commented 5 years ago

@blahah I have joined the gitter! I have finally been able to fully import the metadata from my test PMC database. Their database isn't exactly what I would call "clean" and I definitely need to make some alterations as to how I find the date for papers. Other than that I am back on track as far as things go so hopefully I will have something for you guys to test if you like before long!

I'm pretty well just developing on my branch at the moment due to me breaking things all the time, but I am happy to make pull requests whenever anyone wants. Also, as I am going I am putting the latest test versions on the edge and beta releases of the snap so anyone that wanted to could replicate my setup without too much trouble.