Open matt7salomon opened 1 month ago
Hi Matt, thanks for pointing this out. I'll touch base with someone who has Spark-dumped Parquet files and see what we can find.
Are you getting an error for using chunksize
? This is not a supported read_parquet
argument.
How big are your files? You can use blocksize='50MB'
(instead of chunksize
), but calling compute
will always overflow the memory if your first visible device is smaller than the dataset.
import dask_cudf
dask_df = dask_cudf.read_parquet("/Volumes/path/*.parquet", blocksize='50MB')
dask_df.head() # Does this work?
It would also be good to know what version of cudf/dask-cudf this is.
When I read in a parquet dataset saved with Spark on a databricks catalog I get lots of.
I tried
and
and I also tried using the pyarrow engine on the load which overflows.
Let me know if you need some code to replicate but it should be easy. Any spark based parquet would do.