titipata / arxivpy

Python wrapper for arXiv API
MIT License
56 stars 18 forks source link

Arxivpy

License

Python wrapper for arXiv API. Here are related libraries and repositories: arxiv.py, python_arXiv_parsing_example.py and arxiv-sanity-preserver. arXiv is an open-access journal which has 1M+ e-prints in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics.

Example

Here is an example on how to use arxivpy.

import arxivpy
articles = arxivpy.query(search_query=['cs.CV', 'cs.LG', 'cs.CL', 'cs.NE', 'stat.ML'],
                         start_index=0, max_index=200, results_per_iteration=100,
                         wait_time=5.0, sort_by='lastUpdatedDate') # grab 200 articles

Input search_query can be list of categories or string of arXiv formatted query. Output is a list of dictionary parsed from arXiv XML file. This example will parse 200 last update papers (from index 0 to 200), 100 at a time with wait time around 5 seconds (see note below if scraping many papers).

Queries

You can use other search queries, for example:

search_query=['cs.DB', 'cs.IR']
search_query='cs.DB' # select only Databases papers
search_query='au:kording' # author name includes Kording
search_query='ti:deep+AND+ti:learning' # title with `deep` and `learning`
search_query='abs:%22deep+learning%22' # deep learning as a phrase

Or you can make simple search query using arxivpy.generate_query

search_query = arxivpy.generate_query(terms=['cs.CV', 'cs.LG', 'cs.CL', 'cs.NE', 'stat.ML'],
                                      prefix='category', boolean='OR')

Or convert plain simple text to arXiv query using arxivpy.generate_query_from_text

query = arxivpy.generate_query_from_text("author k kording & author achakulvisut & title science & abstract recommendation") # awesome paper
articles = arxivpy.query(search_query=query)

More search query prefixes, booleans and categories available can be seen from wiki page. More example queries can be found from arXiv user manual

Download PDF

You can also use arxivpy.download to download the articles to given directory. Here is a snippet to do that.

arxivpy.download(articles, path='arxiv_pdf')

Note from API

Installation

The easiest way is to use pip.

pip install git+https://github.com/titipata/arxivpy

You can also do it manually by cloning the repository and run setup.py to install the package.

git clone https://github.com/titipata/arxivpy
cd arxivpy
python setup.py install

Dependencies