Closed sayakpaul closed 3 years ago
I think this can be solved using a Client
with a greater number of retries; the API load here isn't that extreme (360 requests with generous sleep times between requests).
This might also benefit from a larger page size than the Client
default (100). I expect larger page sizes to cause more individual requests to fail, but decreasing the total number of pages fetched might be a net-improvement.
Will test a modified client and update here.
@sayakpaul this client configuration seems to work for me (and, incidentally, significantly decreases the overall runtime). Can you confirm whether it solves the issue?
import arxiv
from tqdm import tqdm
query_keywords = [
"image recognition",
"self-supervised learning",
"representation learning",
"image generation",
"object detection",
"transfer learning",
"transformers",
"adversarial training",
"generative adversarial networks",
"model compressions",
"image segmentation",
"few-shot learning"
]
# Reuse a client with increased number of retries (3 -> 10) and increased page
# size (100->500).
client = arxiv.Client(num_retries=10, page_size=500)
def query_with_keywords(query):
search = arxiv.Search(
query=query,
max_results=3000,
sort_by=arxiv.SortCriterion.LastUpdatedDate
)
terms = []
titles = []
abstracts = []
for res in tqdm(client.results(search), desc=query):
if res.primary_category in ["cs.CV", "stat.ML", "cs.LG"]:
terms.append(res.categories)
titles.append(res.title)
abstracts.append(res.summary)
return terms, titles, abstracts
all_titles = []
all_summaries = []
all_terms = []
for query in query_keywords:
terms, titles, abstracts = query_with_keywords(query)
all_titles.extend(titles)
all_summaries.extend(abstracts)
all_terms.extend(terms)
Thanks very much, @lukasschwab. I am currently testing your solution. Will update here after I am done.
@lukasschwab it works absolutely fine. Was wondering if there's a way to retrieve results by primary terms i.e. an instance where I would like to get the paper titles, abstracts w.r.t their primary tags and without using any keyword.
Is this doable with arxiv
?
@sayakpaul I'm not sure what you mean by "primary terms." As far as I know, arXiv's metadata has no concept of tags/labels––let me know if I'm missing some documentation from arXiv themselves––so, accordingly, there's no API interface for searching by tags/labels.
If you want to search by category, query strings––the argument to arxiv.Search
in the snippet above––do let you query by category. For example, you could let the API search for your three target categories.
The query string would look like this: image segmentation AND (cat:cs.CV OR cat:stat.ML OR cat:cs.LG)
. You could build this in query_with_keywords
:
categories = ["cs.CV", "stat.ML", "cs.LG"]
category_condition = " OR ".join(["cat:" + c for c in categories]) # "cat:cs.CV OR cat:stat.ML OR cat:cs.LG"
def query_with_keywords(query):
query_with_categories = "{} AND ({})".format(query, category_condition)
search = arxiv.Search(
query=query_with_categories,
max_results=3000,
sort_by=arxiv.SortCriterion.LastUpdatedDate
)
terms = []
titles = []
abstracts = []
for res in tqdm(client.results(search), desc=query):
terms.append(res.categories)
titles.append(res.title)
abstracts.append(res.summary)
return terms, titles, abstracts
If your issue is that you're getting undesirable partial matches––e.g. queries for query="image segmentation"
match papers that mention "image" but not "segmentation"––you should have better results by adding double-quotes to the query phrase {query='"image segmentation"'
or query="\"image segmentation\""
}.
Some reference resources:
Hope that's helpful!
Thanks very much for the pointers. This is really helpful :)
Thank you for developing this package. I am trying to put together a dataset of arXiv paper abstracts and their terms. Basically, the abstracts will be features for a machine learning model and it will be tasked to predict the associated terms making it a multi-label classification problem.
I am doing this for an experiment I want to perform in the area of medium-scale multi-label classification.
Here's what I am doing:
Now, while executing this I am abruptly running into:
It's not like the underlying keyword for search does not have any more pages, I have verified that because in a new run the exception happens for a different keyword.
Was wondering if there's a way to circumvent this. Thanks so much in advance.