py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.21k stars 1.39k forks source link

DEV: Mirror freely licensed arXiv documents locally #2904

Open stefan6419846 opened 2 days ago

stefan6419846 commented 2 days ago

We are currently experiencing regular issues with arXiv documents not being available for the Windows CI due to rate limit issues. At the same time, most of these documents are available under permissive licenses which would allow keeping an own repository of it which we could clone for CI while reducing the load for arXiv and GitHub downloads. I am open to generating this on my personal account for the time being.

List of licenses: https://info.arxiv.org/help/license/index.html For our repository, only https://arxiv.org/licenses/nonexclusive-distrib/1.0/license.html would be problematic due to not granting us any rights at all.

stefan6419846 commented 2 days ago

Just did some quick verification:

For the arXiv-only files, we might need to have a look at their usages. Maybe it is possibly to replace them with some more liberal licensed ones without too much side effects.

pubpub-zz commented 1 day ago

my opinion : arxiv.org is available on web:archive.org: https://web.archive.org/web/20241009013003/https://arxiv.org/

arXiv is a free distribution service and an open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Materials on this site are not peer-reviewed by arXiv.

if we duplicate a copy of this files onto the github, isn't this considered as just making the documents available as another copy as web.archive.org would have done ?

we are just using these document as support for test : they are not embedded within pypdf not infringing licenses

stefan6419846 commented 1 day ago

The Wayback Machine tends to be victim of DDOS attacks regularly as well and their traffic is more important for other use cases in my opinion. This does not help with the current amount of regularly failing CI pipelines due to arXiv running into rate limits (even from my local device, without having done any downloads from them the hours before). My goal is to stabilize the tests again where freely licensed documents from arXiv seem like a good idea as they are mostly responsible for the failures at the moment anyway.

if we duplicate a copy of this files onto the github, isn't this considered as just making the documents available as another copy as web.archive.org would have done ?

Nearly everything has a copyright, which has to be considered. As long as we are just downloading the data on the fly and run pypdf on it without persisting protected parts in a publicly accessible way, I do not see any issues (although IANAL). Storing our own public copies instead requires us to respect the original copyright and thus is more restrictive - we are not the Internet Archive.

we are just using these document as support for test : they are not embedded within pypdf not infringing licenses

This issue is specifically talking about creating our own hosted copies of them. In these cases I would like to avoid any licensing issues which could have negative impacts on the maintainers.