josephsdavid / cord-19-tools

Tools for helping out with COVID-19 research
MIT License
28 stars 7 forks source link

2020/04/17 CORD Update Breaks Download #4

Closed jacobdanovitch closed 4 years ago

jacobdanovitch commented 4 years ago

Tried to run download("data") this morning and only got 10 JSON files from last night's biorxiv update. It seems as though about 950 JSON files got uploaded without a folder, which breaks the current download function.

I rewrote it to handle this as well as to allow people to specify which files they'd like to download (either matching a regex or containing a substring). Should I open a PR?

josephsdavid commented 4 years ago

Absolutely! I’ll merge ASAP!

josephsdavid commented 4 years ago

Thank you for checking !!

jacobdanovitch commented 4 years ago

No problem. I haven't used this before so I'm not sure what it previously downloaded; was it only looking for the .tar.gz files? If so, I'll make that the default file filter.

josephsdavid commented 4 years ago

So it was rather poorly coded up till now, currently it simply grabbed the last 10 files from the data because I didn’t expect that to change :) bulk of the data is stored in gz files so I think that’s a reasonable filter

jacobdanovitch commented 4 years ago

PR opened! I've been testing it in Colab so it could probably use a quick test just in case, but it should work for the listed cases. Only additional library used is re.

josephsdavid commented 4 years ago

Merged! I’ll fiddle around with it locally for a bit before it goes on pypi

josephsdavid commented 4 years ago

Yeah looks good to me, I am going to go ahead and package! Thank you!!!