pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.08k stars 1.83k forks source link

Add support for HDFS to `scan_parquet` #16064

Open barak1412 opened 4 months ago

barak1412 commented 4 months ago

Description

As described in the title, HDFS support for the scan_parquet function will be welcomed.

The alternative, scan_pyarrow_dataset is not enough since it doesn't support streaming.

Any fsspec fallback is an option?

Thanks in advance.

bruriah1999 commented 4 months ago

Description

As described in the title, HDFS support for the scan_parquet function will be welcomed.

The aleternative, scan_pyarrow_dataset is not enough since it doesn't support streaming.

Any fsspec fallback is an option?

Thanks in advance.

+1

ion-elgreco commented 3 months ago

Might be possible with: https://github.com/Kimahriman/hdfs-native

santosh-d3vpl3x commented 2 months ago

@ion-elgreco indeed!

Is this something polars maintainers see as valuable addition?

barak1412 commented 2 months ago

@santosh-d3vpl3x I sure they are. @ion-elgreco How much effort would it take?

santosh-d3vpl3x commented 2 months ago

Apparently there is accepted tag that indicates whether feature is accepted or not.

Now, I am not sure what does it take for this feature to get that tag before we sink in a lot of efforts just for it to not get accepted. Perhaps we should discuss the feasibility and the possible approaches to make ourselves confident.

Usually, @ritchie46 performs a triage as I have heard from the polars discord.

barak1412 commented 2 months ago

I understand. Do you familiar with Polars' object store code?