pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.2k stars 1.95k forks source link

Testing with scan_parquet doesn't work anymore from within `io/cloud/test_aws.py` #11528

Open svaningelgem opened 1 year ago

svaningelgem commented 1 year ago

Checks

Reproducible example

Just re-add (pl.scan_parquet, "parquet"), to the parameters of test_scan_s3.

(removed by @ritchie46 in PR #11210 )

Log output

exceptions.ComputeError: Generic S3 error: response error "request error", after 0 retries: builder error for url (http://127.0.0.1:5000/bucket/foods1.parquet): URL scheme is not allowed

Issue description

The call fails. I believe because the object_store crate doesn't like http very much. So, I added (according to the object_store docs here:

    # monkeypatch_module.setenv("AWS_ENDPOINT", f"http://{host}:{port}")
    monkeypatch_module.setenv("AWS_ALLOW_HTTP", "true")

to the s3_base fixture (same file). (I tried with both the endpoint enabled and disabled)

But this just locked (deadlock?) the test. Ie:

INFO     werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket HTTP/1.1" 200 -
INFO     werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket/foods1.csv HTTP/1.1" 200 -
INFO     werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket/foods1.ipc HTTP/1.1" 200 -
INFO     werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket/foods1.parquet HTTP/1.1" 200 -
Terminated

The Terminated is because I killed the process myself after a minute or so.

This is fairly similar to #11372, but I created this new thread because I purely focus on the testing in here.

Expected behavior

I would expect the scan_parquet to read in a LazyFrame.

Installed versions

``` (main branch) --------Version info--------- Polars: 0.19.7 Index type: UInt32 Platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Python: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_sqlite: 0.7.0 cloudpickle: 2.2.1 connectorx: 0.3.2 deltalake: 0.10.1 fsspec: 2023.9.2 gevent: 23.9.1 matplotlib: 3.8.0 numpy: 1.26.0 openpyxl: 3.1.2 pandas: 2.1.1 pyarrow: 13.0.0 pydantic: 2.4.2 pyiceberg: 0.5.0 pyxlsb: 1.0.10 sqlalchemy: 2.0.21 xlsx2csv: 0.8.1 xlsxwriter: 3.1.6 ```
ritchie46 commented 1 year ago

It is because object store tries to connect to aws. This has more to do with making this work with mojo testing than being an actual bug in the aws connection code.

svaningelgem commented 1 year ago

Indeed, but if it's not tested, how can we (read: I) improve on it? 😁

I'm trying to make the sink_parquet work with the object_store code (ticket #11056), but if I can't test it... I can't fix it. And I don't know rust that well (better now I'm digging into it, but still)... So if it's not too much of an issue:

Thanks

TylerGrantSmith commented 11 months ago

@svaningelgem I observed the same issue while trying to use a ThreadedMotoServer. Instead, you can get this to work if you launch moto_server as a subprocess. I am currently using this as a workaround for polars + s3 testing in python.