mattbierbaum / arxiv-public-datasets

A set of scripts to grab public datasets from resources related to arXiv
https://arxiv.org/abs/1905.00075
MIT License
400 stars 63 forks source link

Possible to publish internal-citations.json.gz? #25

Closed turian closed 1 year ago

turian commented 1 year ago

This is a very useful document on its own, and should be relatively small.

Versus running the entire pipepline to compute this file, could it be shared? Perhaps on hugging face?

colinclement commented 1 year ago

This is available in this releases, see internal-references-v0.2.0-2019-03-01.json.gz.

turian commented 1 year ago

Thank you. Are you aware of any more recent crawls?

colinclement commented 1 year ago

Unfortunately not. You can re-run it yourself with the PDF dump on Kaggle for free, though the process is quite slow. I think we ran it originally on a 96 core machine and it took half a day or so. Extracting citations from the LaTex source would be much faster, but you'll have to pony up ~$100 to AWS for egress.

qrdlgit commented 11 months ago

Is there a script for doing that with latex? I was thinking of just grabbing the cocitations for 23/22. Using virginia EC2 on AWS should cut down on egress costs by 1/9th. Happy to publish the result on Kaggle

I'm a bit surprised arxiv doesn't invest more in this, as it would strongly encourage publishing and cociting on arxiv.

IllDepence commented 10 months ago

(... ended up here by mere chance while looking up some references wrt arXiv’s history)

@qrdlgit not sure if it is exactly what you’re looking for regarding the focus on co-citations, but I have a project here that does convert the LaTeX sources of arXiv to structured document representations + a citation network. for a constrained time frame (e.g. just a year or month) the document conversion should be straight forward, but the generation of the citation network relies on a local dump of part of OpenAlex which requires some space and a few steps to set up.

colinclement commented 9 months ago

Is there a script for doing that with latex? I was thinking of just grabbing the cocitations for 23/22. Using virginia EC2 on AWS should cut down on egress costs by 1/9th. Happy to publish the result on Kaggle

I'm a bit surprised arxiv doesn't invest more in this, as it would strongly encourage publishing and cociting on arxiv.

All of the scripts we used to generate all the data presented are available in this repo. The script can download the latex if you switch one parameter, but we did not create tooling for parsing the bib files, though this should be as easy as looking for arXiv IDs. If you would like to contribute an updated version that would be awesome.