lukasschwab / arxiv.py

Python wrapper for the arXiv API
MIT License
1.07k stars 120 forks source link

Proposal for Adding Asynchronous Support #153

Closed BalconyJH closed 5 months ago

BalconyJH commented 8 months ago

Motivation

I am currently using an asynchronous framework to provide users with real-time arXiv literature queries in a conversational approach. However, this repository is written synchronously and uses the requests library for network requests. I am concerned that it will block my main process.

Solution

Add new functions to the existing codebase, which currently uses the requests library for network requests, to incorporate similar asynchronous request libraries like aiohttp or niquests.

Considered alternatives

If possible, while retaining the existing structure of the repository, add asynchronous implementations to functions that perform network requests. Of course, unit tests also need to be added to test these implementations. I believe that the time required to parse Atom feeds is very short and does not significantly impact user experience, so asynchronous parsing might not be necessary. In most cases, the parsing time is relatively short compared to the time taken for network requests and is not a bottleneck, hence maintaining the current synchronous parsing might be a more practical choice.

Additional context

If you're agreeable, I can create a draft pull request to gradually implement the feature mentioned in this issue. If you have any suggestions or questions, I look forward to your response. By the way, I am currently a university student, so the progress of the PR might not be very fast, but I will make sure to provide timely updates in the pull request.

lukasschwab commented 7 months ago

Hi @BalconyJH — two questions!

  1. Can you propose a backwards-compatible way to introduce async support, or is this necessarily a major breaking change?
  2. Is it possible to enclose the existing synchronous behavior in some asynchronous wrapper for your use case, e.g. asyncio.to_thread as described here?
BalconyJH commented 7 months ago

Replace requests Dependency with niquests and Introduce Asynchronous Functionality

Description:

Completely replace the requests library with niquests in the project, removing the dependency on requests. niquests is well compatible with requests.

Changes:

source code in https://github.com/lukasschwab/arxiv.py/blob/8ef0759f870b563a2ef9fff3cf5b13ef437dd737/arxiv/__init__.py#L638C5-L676C1

  1. Replace requests with niquests: According to the niquests blog, this should be a straightforward replacement. Tests in my local environment show that niquests behaves consistently with requests, with all tests passing after necessary adjustments. image

  2. Update Class Initialization: We need to modify class initialization to use niquests. This involves updating import statements and any requests-specific initialization code.

  3. Add New Asynchronous Function:

A new asynchronous function will be added to leverage niquests's asynchronous capabilities. This function will provide additional flexibility in handling HTTP requests.

Images:

Class Initialization Changes

Using niquests

New Async Function

Utilizing asyncio.to_thread or ThreadPoolExecutor provides a feasible method to execute synchronous requests in an asynchronous context.Nevertheless, this approach introduces a level of complexity by blending asynchronous and synchronous paradigms, coupled with the intricacies of thread management. Such complexity could potentially lead to challenges in maintaining and debugging the code. Should this strategy prove to be cumbersome in its implementation, I would consider it as a fallback option.

I eagerly await your thoughts on this approach, particularly any insights into potential risks or nuances that might have been overlooked.

BalconyJH commented 7 months ago

@lukasschwab

lukasschwab commented 7 months ago

Looks like you've gone to the trouble of writing code already. Feel free to open a PR and we can discuss there, with the help of tests/linter. No need to worry about e.g. documentation at this point.

In broad strokes, what you're suggesting here makes sense. Here are some notes.

Update Class Initialization: We need to modify class initialization to use niquests. This involves updating import statements and any requests-specific initialization code.

This is my concern about breaking changes: if arxiv.Client.results goes from being a method you can call synchronously to one that must be awaited, that'll require a major-version change to this API. That's not a non-starter, but I'd want to see more vocal support for the change from the rest of the community.

An alternative interpretation: we may be able to expose a separate async client (from arxiv.async import Client?) with async methods. In this case, we should be careful not to needlessly duplicate logic; factoring common helpers out of the synchronous Client is important.

Completely replace the requests library with niquests in the project, removing the dependency on requests. niquests is well compatible with requests.

Curious — why niquests? aiohttp and httpx appear to be more widely-adopted by orders of magnitude (just taking the first two examples I found).

Is there some specific reason to prefer niquests I'm missing?

New Asynchronous Function:

One note: we'd need to convert since_last_request/_last_request_at to something threadsafe to prevent races between separate but concurrent calls to _async_try_parse_feed.

BalconyJH commented 7 months ago

I apologize for my hastiness. I have only modified a portion of the functions because I believe presenting the code might be a more convenient way to communicate.

An alternative interpretation: we may be able to expose a separate async client (?) with methods. In this case, we should be careful not to needlessly duplicate logic; factoring common helpers out of the synchronous is important.from arxiv.async import Client``async``Client

I believe that creating a separate AsyncClient class would be a better choice. Although it will bring the foreseeable task of maintaining a new class and a similar set of parsing logic(Perhaps it would be straightforward to take feedparser.parse out of the Client class and use it as a separate helper function.), this approach might be more suitable. Additionally, considering that the original download depends on urlretrieve, this would be a good opportunity to incorporate parallel downloading capabilities as well.

Curious — why niquests? aiohttp and httpx appear to be more widely-adopted by orders of magnitude (just taking the first two examples I found).Is there some specific reason to prefer niquests I'm missing?

In fact, using httpx or niquests here doesn't make much difference in terms of functionality; they both support synchronous and asynchronous calls. If there's anything to be noted, it's perhaps as niquests itself claims: they are the "fastest" in synchronous operations and support HTTP/3. I don't have a preference for choosing between the two.

BalconyJH commented 5 months ago

Could we consider discontinuing support for Python 3.7? The latest version of httpx now requires Python 3.8 or higher.