lukasschwab / arxiv.py

Python wrapper for the arXiv API
MIT License
1.07k stars 120 forks source link

UnexpectedEmptyPageError at abrupt intervals #83

Closed sayakpaul closed 3 years ago

sayakpaul commented 3 years ago

Thank you for developing this package. I am trying to put together a dataset of arXiv paper abstracts and their terms. Basically, the abstracts will be features for a machine learning model and it will be tasked to predict the associated terms making it a multi-label classification problem.

I am doing this for an experiment I want to perform in the area of medium-scale multi-label classification.

Here's what I am doing:

  1. Define a list of query strings I want to involve in the dataset:
query_keywords = ["image recognition", 
    "self-supervised learning", 
    "representation learning", 
    "image generation",
    "object detection",
    "transfer learning",
    "transformers",
    "adversarial training",
    "generative adversarial networks",
    "model compressions"
    "image segmentation",
    "few-shot learning"
]
  1. Define a utility function:
def query_with_keywords(query):
    search = arxiv.Search(query=query, 
                        max_results=3000,
                        sort_by=arxiv.SortCriterion.LastUpdatedDate)
    terms = []
    titles = []
    abstracts = []
    for res in tqdm(search.results()):
        if res.primary_category=="cs.CV" or \
            res.primary_category=="stat.ML" or \
                res.primary_category=="cs.LG":

            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
    return terms, titles, abstracts
  1. Looping the above function through the list defined in 1.:
import time

wait_time = 3

all_titles = []
all_summaries = []
all_terms = []

for query in query_keywords:
    terms, titles, abstracts = query_with_keywords(query)
    all_titles.extend(titles)
    all_summaries.extend(abstracts)
    all_terms.extend(terms)

    time.sleep(wait_time)

Now, while executing this I am abruptly running into:

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in __try_parse_feed(self, url, first_page, retries_left, last_err)
    687             # Feed was never returned in self.num_retries tries. Raise the last
    688             # exception encountered.
--> 689             raise err
    690         return feed
    691 

UnexpectedEmptyPageError: Page of results was unexpectedly empty (http://export.arxiv.org/api/query?search_query=representation+learning&id_list=&sortBy=lastUpdatedDate&sortOrder=descending&start=800&max_results=100)

It's not like the underlying keyword for search does not have any more pages, I have verified that because in a new run the exception happens for a different keyword.

Was wondering if there's a way to circumvent this. Thanks so much in advance.

lukasschwab commented 3 years ago

I think this can be solved using a Client with a greater number of retries; the API load here isn't that extreme (360 requests with generous sleep times between requests).

This might also benefit from a larger page size than the Client default (100). I expect larger page sizes to cause more individual requests to fail, but decreasing the total number of pages fetched might be a net-improvement.

Will test a modified client and update here.

lukasschwab commented 3 years ago

@sayakpaul this client configuration seems to work for me (and, incidentally, significantly decreases the overall runtime). Can you confirm whether it solves the issue?

import arxiv
from tqdm import tqdm

query_keywords = [
    "image recognition",
    "self-supervised learning",
    "representation learning",
    "image generation",
    "object detection",
    "transfer learning",
    "transformers",
    "adversarial training",
    "generative adversarial networks",
    "model compressions",
    "image segmentation",
    "few-shot learning"
]

# Reuse a client with increased number of retries (3 -> 10) and increased page
# size (100->500).
client = arxiv.Client(num_retries=10, page_size=500)

def query_with_keywords(query):
    search = arxiv.Search(
        query=query,
        max_results=3000,
        sort_by=arxiv.SortCriterion.LastUpdatedDate
    )
    terms = []
    titles = []
    abstracts = []
    for res in tqdm(client.results(search), desc=query):
        if res.primary_category in ["cs.CV", "stat.ML", "cs.LG"]:
            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
    return terms, titles, abstracts

all_titles = []
all_summaries = []
all_terms = []

for query in query_keywords:
    terms, titles, abstracts = query_with_keywords(query)
    all_titles.extend(titles)
    all_summaries.extend(abstracts)
    all_terms.extend(terms)
sayakpaul commented 3 years ago

Thanks very much, @lukasschwab. I am currently testing your solution. Will update here after I am done.

sayakpaul commented 3 years ago

@lukasschwab it works absolutely fine. Was wondering if there's a way to retrieve results by primary terms i.e. an instance where I would like to get the paper titles, abstracts w.r.t their primary tags and without using any keyword.

Is this doable with arxiv?

lukasschwab commented 3 years ago

@sayakpaul I'm not sure what you mean by "primary terms." As far as I know, arXiv's metadata has no concept of tags/labels––let me know if I'm missing some documentation from arXiv themselves––so, accordingly, there's no API interface for searching by tags/labels.

If you want to search by category, query strings––the argument to arxiv.Search in the snippet above––do let you query by category. For example, you could let the API search for your three target categories.

The query string would look like this: image segmentation AND (cat:cs.CV OR cat:stat.ML OR cat:cs.LG). You could build this in query_with_keywords:

categories = ["cs.CV", "stat.ML", "cs.LG"]
category_condition = " OR ".join(["cat:" + c for c in categories]) # "cat:cs.CV OR cat:stat.ML OR cat:cs.LG"

def query_with_keywords(query):
    query_with_categories = "{} AND ({})".format(query, category_condition)
    search = arxiv.Search(
        query=query_with_categories,
        max_results=3000,
        sort_by=arxiv.SortCriterion.LastUpdatedDate
    )
    terms = []
    titles = []
    abstracts = []
    for res in tqdm(client.results(search), desc=query):
        terms.append(res.categories)
        titles.append(res.title)
        abstracts.append(res.summary)
    return terms, titles, abstracts

If your issue is that you're getting undesirable partial matches––e.g. queries for query="image segmentation" match papers that mention "image" but not "segmentation"––you should have better results by adding double-quotes to the query phrase {query='"image segmentation"' or query="\"image segmentation\""}.

Some reference resources:

Hope that's helpful!

sayakpaul commented 3 years ago

Thanks very much for the pointers. This is really helpful :)