Open mdrach opened 2 years ago
The APIs are out of sync.
Note that the reason for the column_iter is that it allows for nested parquet types. An alternative is to offer a page stream per parquet column and have the users assemble the columns themselves into the corresponding Arrow type, but I think that that requires us to expose a larger (currently private) API and more documentation.
Would you like to tackle this one, or, do you think I should prioritize it?
If you could prioritize that would be great. I may be able to get to this, but likely not in the short term.
I have started working on this. The first change is on parquet2, since there is where we declare these APIs.
In v0.7.0 I could stream in pages of a Parquet column chunk in an async context, then move the data into a dedicated thread pool to perform the CPU-intensive work.
However, as of v0.8.0 page_iter_to_array has been replaced by column_iter_to_array while the async api does not expose a corresponding get_column_stream (only get_page_stream). Is there a better way to load and parse a parquet file from S3? Or, are APIs just out of sync?