kaaveland / pyarrowfs-adlgen2

Use pyarrow with Azure Data Lake gen2
MIT License
25 stars 6 forks source link

How to stream large file without downloading? #15

Closed pdutta777 closed 2 years ago

pdutta777 commented 2 years ago

I am trying to read large parquet files from ADLS using pyarrow. My goal is to read the file in small chunks (using pyarrow record batches) from the parquet file. What I am observing is that the entire parquet file is downloaded and then "streamed" locally. Is there an example that I can use as reference to accomplish streaming from ADLS?

kaaveland commented 2 years ago

Sorry for the late response. Thinking out loud/reading some code here:

I've been using this library in my job, and I know for sure that things appear to work fine when selecting, say, 4 out of 30 columns. Less data than the whole file is transferred. I would definitely have known if this was not the case, I've been using this with some very large datasets. So pyarrow/pandas must be properly passing the offset/desired length for the read to be able to read the parquet footer, then seek to the correct columns and read only those.

Which pyarrow API are you using, and what do your parquet files look like? My library is quite "dumb", it does no smart planning of how to stream/read files, I just expect pyarrow to seek() to the correct place and read() only as many bytes as it needs at a time, which is the forwarded to download_file.

I think the behaviour you're seeing could be related to the shape of your data and how you're trying to batch it -- arrow prefers to read 1 column at a time, and with parquet, 1 row group at a time. Frequently, even quite big files only have 1 row group. Could that be what's happening in your case?

This is just going to be guessing from my side, maybe you can provide a parquet file that triggers the problem?

pdutta777 commented 2 years ago

I started doing some experimentation on the advice of the arrow team, and basically I need to read one row group per thread (or process). So I have tested reading 5 row groups at a time, and the performance is fine reading from ADLS.

My parquet files are on average 1GB in size each and have a lot of nested columns (structs, arrays, maps, etc.), and sometimes I have seen nesting as deep as 5 levels. And I am trying to push this data into ClickHouse by converting to JSON first. So I am opening a thread per file per row group and then using iter_batches to convert this data.

kaaveland commented 2 years ago

Okay -- it's not clear to me whether that means we still think there's an issue in pyarrowfs-adlgen2 or not?

I'm not sure threading will work for you. If threading happens on the arrow side (which is C++), it probably won't -- it's likely that arrow will take the GIL before it calls into python code?

Edit: To make this clear, pyarrowfs-adlgen2 is implemented in Python. It's a goal of the arrow project to implement support for azure blob storage / data lake gen2 in C++ eventually, but until that's done, I needed something...

kaaveland commented 2 years ago

Closing this due to no updates. It's unclear to me if there's anything here I can fix.