Open polinaeterna opened 8 months ago
Indeed, it sounds like a bug. For example, https://huggingface.co/datasets/vivym/midjourney-messages/tree/refs%2Fconvert%2Fparquet/default/partial-train shows only 10 160MB-Parquet files. cc @lhoestq
Or maybe it's 5GB of decompressed data?
It's 5GB of uncompressed data indeed
@lhoestq is this what we want?
I think so, otherwise it could be possible to have a 1TB dataset stored in a 1MB Parquet file and most of our jobs couldn't handle it ^^
Good point. So: it's an issue for moonlanding, more than for datasets-server. But... I'm not sure we want to increase the message length AGAIN
https://github.com/huggingface/moon-landing/pull/8593#issuecomment-1883553457
Currently, on https://huggingface.co/datasets/timm/imagenet-12k-wds:
Size of the auto-converted Parquet files (First 5GB per split):
9.99 GB
where First 5GB
must be understood as First 5GB of uncompressed data
.
What should we show?
Maybe
Size of the auto-converted Parquet files (First 5GB per split) (?):
9.99 GB
or
Parquet export size (First 5GB per split) (?):
9.99 GB
with extra info on hover that shows "The Parquet export only contains the first 5GB of each split (uncompressed)"
otherwise it could be possible to have a 1TB dataset stored in a 1MB Parquet file and most of our jobs couldn't handle it
@lhoestq how is this possible?
Parquet export size (First 5GB per split) (?): 9.99 GB
with extra info on hover that shows "The Parquet export only contains the first 5GB of each split (uncompressed)"
i like this
otherwise it could be possible to have a 1TB dataset stored in a 1MB Parquet file and most of our jobs couldn't handle it
@lhoestq how is this possible?
I made up the numbers but if a dataset is made of 1TB of the same data, then a compression algorithm can theoretically compress the data to just two items (value, length) which could fit in <1MB.
Anyway this was just a way to illustrate than the compressed size of data can't give much information abotu the real data size in general.
Some datasets*, for example vivym/midjourney-messages and Open-Orca/OpenOrca are not fully converted or copied(?) to parquet export. These two are originally both in parquet format
midjourney original parquet files are about 8-9Gb in total, so the part of the data for datasets-server should be about 5Gb for sure but the viewer shows 1.56Gb:
original OpenOrca is two parquet files about 4Gb in size total but the viewer shows 2.85Gb:
Am I misunderstanding something or is it a bug?
*idk i assume there might be more of them but i have these two examples in mind.