lukasschwab / arxiv.py

Python wrapper for the arXiv API
MIT License
1.11k stars 123 forks source link

Usage with pandas dataframe #54

Closed makkammerer closed 3 years ago

makkammerer commented 3 years ago

Could you give an example of wrapper's usage with pandas dataframe? Previous version could it!

lukasschwab commented 3 years ago

This version of the client can do (almost) everything the previous version could do, just with slightly different syntax. It'd be easier to answer your question with an example of what used to work, since I'm not familiar with Pandas, but here's my best shot.

The big difference: results are now instances of the arxiv.Result class rather than dicts. These need to be converted into a DataFrame-readable structure.

This StackOverflow answer extracts member variables from the object; we can use it on our results:

import arxiv
from pandas import DataFrame

# Search for 10 results.
results = arxiv.Search(query="quantum", max_results=10).get()
# Convert `arxiv.Result` fields into dictionaries.
results_as_dicts = [vars(r) for r in results]

# Construct DataFrame.
df = DataFrame(data=results_as_dicts)

Which yields a reasonable DataFrame (abridged here):

>>> df
                                  entry_id  ...                                              links
0  http://arxiv.org/abs/quant-ph/0201082v1  ...  [<arxiv.arxiv.Result.Link object at 0x1105f9f1...
1  http://arxiv.org/abs/quant-ph/0407102v1  ...  [<arxiv.arxiv.Result.Link object at 0x1105f93d...
2         http://arxiv.org/abs/0804.3401v1  ...  [<arxiv.arxiv.Result.Link object at 0x1105f921...
3         http://arxiv.org/abs/1311.4939v1  ...  [<arxiv.arxiv.Result.Link object at 0x110569b5...
4        http://arxiv.org/abs/1611.03472v1  ...  [<arxiv.arxiv.Result.Link object at 0x110573f9...
5     http://arxiv.org/abs/q-alg/9610034v1  ...  [<arxiv.arxiv.Result.Link object at 0x11057315...
6  http://arxiv.org/abs/quant-ph/0302169v1  ...  [<arxiv.arxiv.Result.Link object at 0x11057349...
7  http://arxiv.org/abs/quant-ph/0309066v1  ...  [<arxiv.arxiv.Result.Link object at 0x11060365...
8  http://arxiv.org/abs/quant-ph/0504224v1  ...  [<arxiv.arxiv.Result.Link object at 0x11060379...
9        http://arxiv.org/abs/2006.03757v1  ...  [<arxiv.arxiv.Result.Link object at 0x110603a1...

[10 rows x 10 columns]
>>> df.columns
Index(['entry_id', 'updated', 'published', 'title', 'authors', 'summary',
       'comment', 'primary_category', 'categories', 'links'],
      dtype='object')

If you want more than those 10 fields in the DataFrame, customize the conversion of arxiv.Results into dicts. For example, to add short IDs:

def convert(result):
  row = vars(result)
  row['short_id'] = result.get_short_id()
  return row

# Search for 10 results.
results = arxiv.Search(query="quantum", max_results=10).get()

# Construct DataFrame using the custom `convert` transform.
df_extra = DataFrame(data=[convert(r) for r in results])

This new DataFrame includes the short IDs:

>>> df_extra
Index(['entry_id', 'updated', 'published', 'title', 'authors', 'summary',
       'comment', 'primary_category', 'categories', 'links', 'short_id'],
      dtype='object')