pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.88k stars 1.92k forks source link

Polars blocks when running async functions in Dash #10190

Open david-waterworth opened 1 year ago

david-waterworth commented 1 year ago

Checks

Reproducible example

High level I have a worker that I'm calling from as Dash app.

async def  task():
  ids = await query_ids()

  for id in ids:
      data = await query_data(id=id, start=start_datetime, end=end_datetime) 

      df = pl.DataFrame(
          data, 
          schema={'id':pl.Int64(), 'ts':pl.Utf8(), 'data':pl.Float64()}
          )

      df = df.with_columns(pl.col('ts').str.strptime(pl.Datetime, "%Y-%m-%dT%H:%M:%S%.3f%Z"))
      df = df.with_columns(pl.col('ts').dt.replace_time_zone("UTC"))

It's triggered within a Dash background callback asyncio.run(task())

Issue description

It was working, it works fine in a notebook but a slight code refactor has resulted in some form of deadlock on the frist df.with_columns call - I've verified the contains of the original data (json) reponse from my api and the df is constructed correctly.

I'm really not sure how to diagnose this further - I've set POLARS_MAX_THREADS=1 but that didn't help, I'm not really sure what else I can provide but I'm happy to perform any suggested diagnosis

Expected behavior

Doesn't hang

Installed versions

``` --------Version info--------- Polars: 0.18.6 Index type: UInt32 Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35 Python: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0] ----Optional dependencies---- adbc_driver_sqlite: connectorx: 0.3.1 deltalake: fsspec: matplotlib: 3.7.2 numpy: 1.24.4 pandas: 2.0.3 pyarrow: 12.0.1 pydantic: 1.10.11 sqlalchemy: xlsx2csv: xlsxwriter: ```
ritchie46 commented 1 year ago

What does your event loop do? Is that multi-threaded? Multi-process?

david-waterworth commented 1 year ago

I don't know, I didn't create one myself - I guess there's one created by Dash (i.e. https://dash.plotly.com/background-callbacks)

I'm using the local callback manager, not Redis or Celery, i.e.

    import diskcache
    cache = diskcache.Cache("./cache")
    background_callback_manager = DiskcacheManager(cache)

And inside the callback I use asyncio.run() to execute the task that contains the polars code.

Tomorrow I'll replace the background-callbacks with a standard callback and see if it works.

david-waterworth commented 1 year ago

I think it's multi-process based on an exception trace when I aborted a hang, and htop.

So far I've not found an example of using Dash background-callbacks where the callback function is async, so what I've done is re-write the task so its not async for now. That solved the issue - I'll post on the Dash forum to find out the proper way of calling async background workers (I suspect you may have to start them using the background_callback_manager rather than asyncio.run and I'll report back any feedback.

ritchie46 commented 1 year ago

If they use multi-processing it could be they use spawn. Which is incorrect if any other child process has some concurrency/mutex/thread running. I don't understand how that could have been a reasonable default.

https://pola-rs.github.io/polars-book/user-guide/misc/multiprocessing/

david-waterworth commented 1 year ago

Yeah once I got it working I then moved on to trying to use Celery as the back-end and it broke again with the celery default prefork but works fine with solo (https://distributedpython.com/posts/celery-execution-pools-what-is-it-all-about/, https://distributedpython.com/posts/celery-pool-types-part-1-the-worker-and-the-pool/)

Tomorrow I'll try and see if I can recreate using a Celery only job (i.e. without Dash) and synthetic data.

david-waterworth commented 1 year ago

It also works with the celery threads backend. And running the exact same task directly as a celery job (i.e. without Dash - which uses Flask) there's no problem with the prefork backend.

deanm0000 commented 1 year ago

This is a bit of a tangent but I would recommend dumping dash entirely and just picking up js, and react for making front ends with a fastapi backend. Dash (in contrast to something like shiny for R or streamlit for Python) is already a light wrapper of JSX on react so instead of html.Div(children) you have

children
. Obviously that is an oversimplification but your ui definitions will look pretty similar. Additionally, every dash extension is just a react library that somebody wrote a wrapper for so there are way more things you can do in react vs dash and nothing that only works in dash.

The dash forums will tell you that they're in a never ending compatibility battle with flask where most flask updates break something in dash. Flask, as you know, isn't natively async or parallel whereas fastapi is and was designed that way from the start. Its syntax is almost the same as flask where you just decorate end points with routes.

Unfortunately the nodejs version of polars only (seemingly) works in a node server not a browser but duckdb has a webassembly version that does run in the browser so there's a lot of work that can be offloaded to the browser of you so choose.

Rant over.

Not really dash related but just my intuition, shouldn't you do asyncio.create_task(query_data(...)) and then await all those tasks? With the await in the body of the for loop it's not doing anything in parallel or concurrently. Unless your issue is that other unrelated requests are hanging for this but that kinda goes back to dash's design of being inherently synchronous.

david-waterworth commented 1 year ago

Thanks @deanm0000 - if I was building production UIs I would probably take the time to learn react, but I'm just creating prototypes for machine learning-based interactive workflows. I have already found dash to be "unstable" though, the stuff I was using was refactored significantly, making it really hard to work out why my dashboard suddenly broke though!

The main advantage of using the built-in dash background callback is there's support for progress reporting, as well as being non-blocking. I've found when trying to create even the most basic prototype that invokes any sort of significant (long-running) ML computation, you need a queue to ensure you don't run out of server resources, and some sort of progress to stop users from reloading the page thinking it's failed.

Basically, building a simple prototype wasn't overly simple - particularly when the analytics involves a pipeline combining a large dataset being fed through a pipeline consisting of polars steps in between different ML models.