Open TheWCKD opened 1 year ago
Hi @TheWCKD, thanks for reporting!
Could you please provide the error message you're encountering? It would help diagnosing further.
Also, is it possible for you to provide more info on what is running on your http://127.0.0.1:9000
, as I cannot reproduce the issue locally without that info?
Hi! @vnlitvinov, The 127.0.0.1:9000 is my database on which I'm running a query returning a downloadable CSV file with the query results. The errors I'm getting are those:
@TheWCKD, can you run ray.init() after declaring the var with url? I suspect workers may not see the url.
@TheWCKD, can you run ray.init() after declaring the var with url? I suspect workers may not see the url.
Same issue, even when I did what you mentioned.
import modin.pandas as pd
import ray
url = 'http://127.0.0.1:9000/exp?query=SELECT+pair%2C+price%2C+ts+FROM+%27klines_1s%27+WHERE+ts+%3E%3D+%272022-01-01T00%3A00%3A00.000000Z%27+AND+ts+%3C+%272022-01-02T00%3A00%3A00.000000Z%27'
ray.init(
num_cpus=6,
ignore_reinit_error=True,
runtime_env={"pip": ["modin"], "env_vars": {"__MODIN_AUTOIMPORT_PANDAS__": "1"}},
)
seconds = pd.read_csv(url)
Seems like the problem starts from the aiohttp package? Here's the full stack trace:
But it's strange because making a HTTP GET request with aiohttp package alone works perfect:
import aiohttp
import asyncio
url = "http://127.0.0.1:9000/exp?query=SELECT+pair%2C+price%2C+ts+FROM+%27klines_1s%27+WHERE+ts+%3E%3D+%272022-01-01T00%3A00%3A00.000000Z%27+AND+ts+%3C+%272022-01-02T00%3A00%3A00.000000Z%27"
async def main():
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
print(resp.status)
asyncio.run(main()) # returns 200
I had the same issue with Dask engine too, however they mention the read_csv from Dask only supports few urls, and I can't see one starting with normal http://
However, pandas read_csv states it accepts any http:
You might imagine downloading the csv on the hard drive every time I run a query on my database just to have it read by modin as a .csv file is really inefficient.
@TheWCKD, can you try to run your example with Dask engine (MODIN_ENGINE=dask) to see if the issue still persists?
@TheWCKD, can you try to run your example with Dask engine (MODIN_ENGINE=dask) to see if the issue still persists?
Yup, same error. I ran the following:
import modin.pandas as pd
import modin.config as cfg
cfg.Engine.put('dask')
url = 'http://127.0.0.1:9000/exp?query=SELECT+pair%2C+price%2C+ts+FROM+%27klines_1s%27+WHERE+ts+%3E%3D+%272022-01-01T00%3A00%3A00.000000Z%27+AND+ts+%3C+%272022-01-02T00%3A00%3A00.000000Z%27'
seconds = pd.read_csv(url)
and I'm still getting the same error:
Hmm, strange. @vnlitvinov, did you reproduce the error?
Meanwhile, @TheWCKD, I wonder if you are setting any vars regarding proxy (http_proxy, https_proxy)? Could you try to unset those?
unset http_proxy
unset https_proxy
Also, I wonder if one of the workers listens to the same port (i.e., 9000)? Can you check it for Dask engine, for example?
import modin.pandas as pd
import modin.config as cfg
from distributed.client import Client
client = Client(n_workers=<your_num_cpus>, threads_per_worker=1)
client.scheduler_info()["workers"]
cfg.Engine.put('dask')
url = 'http://127.0.0.1:9000/exp?query=SELECT+pair%2C+price%2C+ts+FROM+%27klines_1s%27+WHERE+ts+%3E%3D+%272022-01-01T00%3A00%3A00.000000Z%27+AND+ts+%3C+%272022-01-02T00%3A00%3A00.000000Z%27'
seconds = pd.read_csv(url)
Hello again, @YarShev, thanks for holding onto the issue! I don't have any proxy settings, and as I said, the read_csv function from pandas works perfectly fine with the localhost url.
I have even tested it with a REST GET request and this is what I got, maybe it helps:
I have run your code above and no workers seem to listen to port 9000 as you can see below, and I still got the same error, really, really strange...
Yes, it is really confusing. I wonder if the issue persists with every distributed engine Modin has? Can you try to run Modin with unidist? Here is some info regarding the run of Modin with unidist https://modin.readthedocs.io/en/stable/development/using_pandas_on_unidist.html.
@YarShev I cannot repro this, as I don't have that specific local server running the SQL-over-http thing. My intuition says that it reacts with a 404
on some repeated request (e.g. when a worker wants to download a piece of the file for parsing) or something similar.
@TheWCKD could you run this code piece and report back?
from modin.core.io.file_dispatcher import OpenFile
with OpenFile(your_url) as f:
pos = f.tell()
line = f.readline()
f.seek(pos)
blob = f.read(4096)
print(pos, line, hex(blob))
You might imagine downloading the csv on the hard drive every time I run a query on my database just to have it read by modin as a .csv file is really inefficient.
Also, note that this is only a single piece of the puzzle - pandas and Modin work differently with such URLs. Modin would try to open your URL multiple times from different workers and would try to read chunks from it. Note that, if your server does not cache responses, this would most likely make it execute the query multiple times. So, to make things efficient, one would have to cache the response on a server side to make sure workers are getting exact same output each (maybe by introducing some middleware on the http server).
Having said that, I see that the server seems to be storing some SQL, is there any particular reason you cannot tell Modin to read from SQL directly without an intermediate http server?
@YarShev I cannot repro this, as I don't have that specific local server running the SQL-over-http thing. My intuition says that it reacts with a
404
on some repeated request (e.g. when a worker wants to download a piece of the file for parsing) or something similar.@TheWCKD could you run this code piece and report back?
from modin.core.io.file_dispatcher import OpenFile with OpenFile(your_url) as f: pos = f.tell() line = f.readline() f.seek(pos) blob = f.read(4096) print(pos, line, hex(blob))
I ran this exact piece of code and still receiving the same errors:
BadHttpMessage: 400, message='Expected HTTP/'
FileNotFoundError: http://127.0.0.1:9000/exp?query=SELECT+pair%2C+price%2C+ts+FROM+%27klines_1s%27+WHERE+ts+%3E%3D+%272022-01-01T00%3A00%3A00.000000Z%27+AND+ts+%3C+%272022-01-02T00%3A00%3A00.000000Z%27
You might imagine downloading the csv on the hard drive every time I run a query on my database just to have it read by modin as a .csv file is really inefficient.
Also, note that this is only a single piece of the puzzle - pandas and Modin work differently with such URLs. Modin would try to open your URL multiple times from different workers and would try to read chunks from it. Note that, if your server does not cache responses, this would most likely make it execute the query multiple times. So, to make things efficient, one would have to cache the response on a server side to make sure workers are getting exact same output each (maybe by introducing some middleware on the http server).
Having said that, I see that the server seems to be storing some SQL, is there any particular reason you cannot tell Modin to read from SQL directly without an intermediate http server?
Thank you for your extensive review. I am using a fast time-series database service called questdb and the fastest way to query the enormous amount of data I have is through their HTTP REST API (https://questdb.io/docs/develop/query-data/#http-rest-api) by returning a csv. I have previously tried the read_sql methods from pandas, however the database doesn't support postgres server side cursors and I'm unable to read any data this way.
I guess I'll have to stick to pandas read_csv then, perhaps you're right and the database supports only one request for the entire query. Thank you for all your help!
I have no idea on what questdb is, but their docs claim to support Postgres cursors: https://questdb.io/docs/develop/query-data/#postgresql-wire-protocol, maybe using .read_sql_query()
could work?
Also, one could compose a rather simple intermediate caching server in Python to put in between your questdb and Modin code... or even just download that URL to a temporary file in your script, read that and remove the file.
With all that, my gut feeling tells me that the bug you're seeing is probably either a bug in fsspec
or a problem in questdb (i.e. Modin itself is not directly to blame here).
Reference: https://github.com/fsspec/filesystem_spec/pull/701 - if you look at that PR, you'll see that fsspec
errors out if it cannot query file size over HTTP, and I'm not sure questdb responds what size is the query result... it might not even support chunked download of the result, which would mean Modin won't be able to parse it in a parallel way.
I have no idea on what questdb is, but their docs claim to support Postgres cursors: https://questdb.io/docs/develop/query-data/#postgresql-wire-protocol, maybe using
.read_sql_query()
could work? Also, one could compose a rather simple intermediate caching server in Python to put in between your questdb and Modin code... or even just download that URL to a temporary file in your script, read that and remove the file.With all that, my gut feeling tells me that the bug you're seeing is probably either a bug in
fsspec
or a problem in questdb (i.e. Modin itself is not directly to blame here). Reference: https://github.com/fsspec/filesystem_spec/pull/701 - if you look at that PR, you'll see thatfsspec
errors out if it cannot query file size over HTTP, and I'm not sure questdb responds what size is the query result... it might not even support chunked download of the result, which would mean Modin won't be able to parse it in a parallel way.
Yup, it supports cursors, but not server side cursors (https://www.psycopg.org/psycopg3/docs/advanced/cursors.html).
You're probably right, questdb may not respond with the size of the query result. I'll make sure to ask them these specific questions on Slack. For now I'll stick with pandas and if I manage to find out the culprit I'll make sure to leave a comment on this issue. Thanks for your time and help! I'll close this for now.
Modin version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[x] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
Expected Behavior
Pandas read_csv function works perfectly for this example, however Modin doesn't.
Error Logs
Installed Versions