jannisborn / paperscraper

Tools to scrape publication metadata from pubmed, arxiv, medrxiv and chemrxiv.
MIT License
255 stars 30 forks source link

Scrape X-rxiv via API #33

Open jannisborn opened 1 year ago

jannisborn commented 1 year ago

Currently bio/med/chemrxiv scraping requires user to first download the entire DB and store locally.

Ideally, these dumps should be stored on a server and updated regularly (cron job). Users would just send requests to the server API. That would be the new default behaviour, but local download should still be supported too

AstroWaffleRobot commented 10 months ago

Hi there. Thanks for work on this project. As a temporary solution, I've saved the dbs on a requester-payer s3 bucket. To download the jsonl files, use these commands:

aws s3 cp s3://astrowafflerp/biorxiv.jsonl biorxiv.jsonl --request-payer aws s3 cp s3://astrowafflerp/chemrxiv.jsonl chemrxiv.jsonl --request-payer aws s3 cp s3://astrowafflerp/medrxiv.jsonl medrxiv.jsonl --request-payer

https://docs.aws.amazon.com/AmazonS3/latest/userguide/ObjectsinRequesterPaysBuckets.html

I've got a cron job that runs daily, so they should be current, but let me know if you have any trouble.

Here's the maintainer script: https://github.com/AstroWaffleRobot/getlit

jannisborn commented 10 months ago

Hi @AstroWaffleRobot, Thx this is a nice initiative and it's great the script is also available. I'd like to have an internal solution inside paperscraper, the easy way would be to adapt your code to create an update_dumps() function that just updates all local dumps since the tool was used the last time. Would be easy to also trigger it automatically whenever a search is performed to make sure it's up to date.

jannisborn commented 10 months ago

Long-term I want to create a lightweight API that I can deploy on my own VM which serves the requests. On the VM I want a daily cronjob to update the data and the API would run the package itself, just in its current mode where data is assumed to be locally available. That way, there's a dual usage, users could either use the package out of the box without slow download of the dumps or do it the old (current) way by downloading dumps first

yarikoptic commented 3 weeks ago

I guess no more of that bucket?

dandi@drogon:~$ aws s3 ls s3://astrowafflerp/

An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist

FWIW -- I wanted to check sizes, I could have probably picked up serving those from https://datasets.datalad.org/ or some other S3 bucket