[BUG] Incorrect read_parquet on spark distributed parquet files.

rapidsai / cudf

cuDF - GPU DataFrame Library

https://docs.rapids.ai/api/cudf/stable/

Apache License 2.0

8.47k stars 907 forks source link

[BUG] Incorrect read_parquet on spark distributed parquet files. #16968

Open matt7salomon opened 1 month ago

matt7salomon commented 1 month ago

When I read in a parquet dataset saved with Spark on a databricks catalog I get lots of . I tried

import glob

cudf_dfs = [cudf.read_parquet(file) for file in glob.glob("/Volumes/path/*.parquet")]
cudf_df = cudf.concat(cudf_dfs,ignore_index=True)

and

import dask_cudf
dask_df = dask_cudf.read_parquet("/Volumes/path/*.parquet",chunksize='50MB')
cudf_df = dask_df.compute()

and I also tried using the pyarrow engine on the load which overflows.

Let me know if you need some code to replicate but it should be easy. Any spark based parquet would do.

vyasr commented 1 month ago

Hi Matt, thanks for pointing this out. I'll touch base with someone who has Spark-dumped Parquet files and see what we can find.

rjzamora commented 1 month ago

Are you getting an error for using chunksize? This is not a supported read_parquet argument.

How big are your files? You can use blocksize='50MB' (instead of chunksize), but calling compute will always overflow the memory if your first visible device is smaller than the dataset.

import dask_cudf
dask_df = dask_cudf.read_parquet("/Volumes/path/*.parquet", blocksize='50MB')
dask_df.head()  # Does this work?

It would also be good to know what version of cudf/dask-cudf this is.