lukasschwab / arxiv.py

Python wrapper for the arXiv API
MIT License
1.07k stars 120 forks source link

Request for help #155

Closed lukasschwab closed 6 months ago

lukasschwab commented 6 months ago

@lukasschwab Hi, I have successfully implemented the code to seek the results using the wrapper, when I ran after 2 weeks, it's not giving the data and giving blank response, could you suggest what could be the issue, below is the my code:

import datetime
import pandas as pd
import time
from concurrent.futures import ThreadPoolExecutor

start_date = datetime.datetime(2024, 1, 1, tzinfo=datetime.timezone.utc)
end_date = datetime.datetime(2024, 1, 4, 23, 59, 59, tzinfo=datetime.timezone.utc)

search = arxiv.Search(
    query="cat:cs.AI OR cat:stat.ML",
    sort_by=arxiv.SortCriterion.SubmittedDate,
    max_results=1000  # Set a default value for max_results
)

titles = []
authors_list = []
affiliations_list = []  
categories_list = []
published_dates = []
pdf_links = []

batch_size = 50  
retry_attempts = 3  

def fetch_results(offset):
    try:
        search.start = offset
        results = client.results(search)

        for r in results:
            published_date = r.published

            if start_date <= published_date.replace(tzinfo=datetime.timezone.utc) <= end_date:
                titles.append(r.title)

                authors = ", ".join([author.name for author in r.authors])
                authors_list.append(authors)

                affiliations = ", ".join([author.affiliation if hasattr(author, 'affiliation') else '' for author in r.authors])
                affiliations_list.append(affiliations)

                categories = ", ".join(r.categories)
                categories_list.append(categories)

                published_dates.append(published_date)

                pdf_links.append(r.pdf_url)

    except Exception as e:
        print(f"An error occurred: {e}")

total_batches = (search.max_results or 1000) // batch_size + 1

with ThreadPoolExecutor(max_workers=5) as executor:
    # Submit tasks to fetch results for each batch
    for offset in range(0, total_batches * batch_size, batch_size):
        executor.submit(fetch_results, offset)

data = {
    'Title': titles,
    'Authors': authors_list,
    'Affiliations': affiliations_list,  
    'Categories' : categories_list,
    'Published Date': published_dates,
    'PDF Link': pdf_links
}

df = pd.DataFrame(data)

Originally posted by @vaish30 in https://github.com/lukasschwab/arxiv.py/issues/43#issuecomment-2000279746

lukasschwab commented 6 months ago

@vaish30 — I can't reproduce any issue with the API or this package. Taking a minimal example:

>>> import arxiv
>>> 
>>> search = arxiv.Search(
...   query="cat:cs.AI OR cat:stat.ML",
...   sort_by=arxiv.SortCriterion.SubmittedDate,
...   max_results=1000 # Set a default value for max_results
... )
>>> r = search.results()
>>> next(r) # non-empty
lukasschwab commented 6 months ago

I'm pretty sure the issue is just that your filter condition date range is now so far in the past that none of the 1000 most recent results for that search belong to it.

Please do try to investigate issues yourself before opening tickets. If you believe you've identified a true issue (i.e. a divergence from documentation) for a package, it's best to provide a minimal code snippet reproducing the issue — one without confounding application logic or other dependencies.