hotosm / osm-fieldwork

Processing field data from ODK to OpenStreetMap format, and other field data collection utils.
GNU Affero General Public License v3.0
16 stars 77 forks source link

Replace pySmartDL and ThreadPoolExecutor with aiohttp #279

Open spwoodcock opened 1 month ago

spwoodcock commented 1 month ago

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

joemaren commented 1 month ago

Will you require aiohttp to run multiple threads like pySmartDL and ThreadPoolExecutor or just asynchronously?

spwoodcock commented 1 month ago

Just asynchronously using the thread pool I think 👍

aiohttp has a limit for the connection pool size for downloads, with the default being 100

RonaldRonnie commented 2 weeks ago

Does it mean that the focus is not to relying on multi-threading??? also, I think the slicing logic that splits xforms into chunks based on the number of CPU cores can remain , having looked at the code. Right? Also i would like to work on this given the procedures to test results after. I request to be assigned to this @spwoodcock .

rsavoye commented 2 weeks ago

Yes, the number of threads is based on the number of cores. I think you'd want to keep the thread pool even if you use aiohttp. For FMTM you're downloading small amounts of tiles so don't see the performance problems. I'm commonly downloading imagery for a huge area. And often the server with the imagery is bandwidth limited, so the more concurrent threads the better. I used SmartDL cause it was easy, but I don't really see a problem with it. When downloading many, many tiles, but running from the command line, I do like to monitor the progress, which may run for hours... sometimes a few days, and it's the only way I can make sure it's still working. I don't have any issues if you want to replace it with aiohttp, most of what it is doing is network access. But please keep the thread pool unless aiohttp has something similar. A quick check, it appears to be single threaded, so I think more discussion is needed before starting on anything.

spwoodcock commented 2 weeks ago

When it comes to IO-bound (network) tasks like downloading files, asyncio is much more efficient than threading.

We can't download all the files at once, as this chokes the network, so the OS manages connections for us.

The download process is the same between the two but threading creates a blocking socket (that will sit idle until the OS returns a response), while asyncio creates a non-blocking socket.

For asyncio, when a coroutine initiates some IO, the given socket is added to the selector. Then when that selector notices the socket has some actionable change, the coroutine waiting for it is continued.

Asyncio is there to basically minimise this unnecessary idle time and only action tasks when the OS sees they are ready to be completed. It does all of this on a single thread and is more resource efficient.

It's a minor benefit when running standalone, but when running on an ASGI webserver like FastAPI, as we can pass this to a single background worker thread to run until complete (instead of using up threads that could be used by other async tasks).

spwoodcock commented 2 weeks ago

The logic to slice up the tasks and get the CPU cores can be removed entirely. This can be handled by the aoihttp connection pool automatically

joemaren commented 2 weeks ago

I feel the same way. Keeping thread pool defeats the purpose of using aiohttp. It creates more complexity. aiohttp also manages multiple connections with a pool size which defaults to 100.

RonaldRonnie commented 2 weeks ago

I believe transitioning to aiohttp without the thread pool is the better approach due to its simplicity and efficiency. It aligns well with modern async practices, especially in environments where async tasks need to be managed effectively. However, I recognize that long-term monitoring and handling bandwidth limitations might require additional considerations. These could include more extensive testing and fine-tuning aiohttp’s connection pool settings to meet those specific needs.

That said, I also acknowledge @rsavoye’s valid points regarding large-scale downloads. In cases where monitoring progress over extended periods is essential, or where maximizing bandwidth on limited servers is a concern, using multiple threads may still offer advantages. However, these edge cases can often be addressed by tuning aiohttp settings (e.g., adjusting connection pool size) and implementing custom progress logging.

In summary, with proper configuration and some minor custom additions, aiohttp can be optimized to handle even more demanding scenarios without needing to revert to threading.

joemaren commented 2 weeks ago

I think @spwoodcock mentioned that the tiles to be downloaded are in kilobytes, so I don't see the concern for large scale downloads. Failed downloads can be retried. Yes, custom progress logging can still be implemented

rsavoye commented 2 weeks ago

If aiohttp has support for it's own pool, I don't have a problem with the change. I just want to have multiple network connections to the remote server so I can download multiple tiles in parallel since I'm often downloading many, many thousands of tiles and don't want it to take forever... I agree that for python, threading is more for computational tasks (like conflation), and asyncio is more for I/O bound tasks. Also I still like the verbosity when running in a terminal, so that should be optional.

rsavoye commented 2 weeks ago

Just as a note, I wrote basemapperr originally to make large basemaps of imagery and topo maps for OsmAnd that cover an entire state, like all of Colorado. I use this for navigating in rural areas where often the OSM data isn't very good.

spwoodcock commented 2 weeks ago

There is no limit to the size of the area we attempt to download with either approach.

Updating to aoihttp simply makes the downloads more resource efficient.

It may even be marginally faster.

rsavoye commented 2 weeks ago

I can test it with a large AOI when there is a PR to review.

RonaldRonnie commented 2 weeks ago

Can i go ahead and work on this then we do testing after the PR

spwoodcock commented 2 weeks ago

For sure! Go ahead 😁🙏

RonaldRonnie commented 2 weeks ago

Am I allowed to use the Caching and Resume Support.

spwoodcock commented 2 weeks ago

I'm not sure Resume functionality will provide much benefit as the files are so small.

Caching isn't actually necessary either, as the files on disk are essentially our cache.

The flow should be like this:

RonaldRonnie commented 2 weeks ago

Hi @spwoodcock , is it okay to implement asyncio.gather to run multiple asycnronous download tasks concurrrent??

joemaren commented 2 weeks ago

You don't need to use asyncio.gather() Use asyncio.create_task() to create a task object and append to it.

spwoodcock commented 2 weeks ago
async def aiohttp_get(session: aiohttp.ClientSession, url: str):
    async with session.request("GET", url) as res:
        data = await res.read()
        return data

async with aiohttp.ClientSession() as session:
    await asyncio.gather(*(aiohttp_get(session, url) for url in urls))

I haven't tested the code above, but hopefully it gives an idea how to solve.

Using asyncio.gather is good 👍

Be sure to create the ClientSession once, outside of the async function that is pooled using gather (so the same connection pool is reused and not recreated each download)