Open AugustDev opened 2 months ago
@AugustDev what's your compute machine configuration? The error msg says the reason: streaming at one point created a spanner of 27084389376 integers, which is 8 bytes for each, so ~200GB. If you have less CPU memory than that, OOM is pretty much expected.
Loading large dataset gives an error: "MemoryError: Unable to allocate 202. GiB for an array with shape (27084389376,) and data type int64"
Environment
To reproduce
I have very large dataset I converted following "Spark to MDS" tutorial on the MosaicML website. I have the dataset in a disk mounted to my machine. I am able to load eval (much smaller dataset), however when loading train dataset it gives an error.
When loading the dataset I get an error
Expected behavior
Dataset loads.
Additional context
My
index.json
is 243 MB for train dataset.I have used the following Spark settings to convert to MDS