First 5Gb are not 5Gb 🤔

huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.

https://huggingface.co/docs/dataset-viewer

Apache License 2.0

689 stars 76 forks source link

First 5Gb are not 5Gb 🤔 #2300

Open polinaeterna opened 8 months ago

polinaeterna commented 8 months ago

Some datasets*, for example vivym/midjourney-messages and Open-Orca/OpenOrca are not fully converted or copied(?) to parquet export. These two are originally both in parquet format

midjourney original parquet files are about 8-9Gb in total, so the part of the data for datasets-server should be about 5Gb for sure but the viewer shows 1.56Gb:
original OpenOrca is two parquet files about 4Gb in size total but the viewer shows 2.85Gb:

Am I misunderstanding something or is it a bug?

*idk i assume there might be more of them but i have these two examples in mind.

severo commented 8 months ago

Indeed, it sounds like a bug. For example, https://huggingface.co/datasets/vivym/midjourney-messages/tree/refs%2Fconvert%2Fparquet/default/partial-train shows only 10 160MB-Parquet files. cc @lhoestq

severo commented 8 months ago

Or maybe it's 5GB of decompressed data?

lhoestq commented 8 months ago

It's 5GB of uncompressed data indeed

polinaeterna commented 8 months ago

@lhoestq is this what we want?

lhoestq commented 8 months ago

I think so, otherwise it could be possible to have a 1TB dataset stored in a 1MB Parquet file and most of our jobs couldn't handle it ^^

severo commented 8 months ago

Good point. So: it's an issue for moonlanding, more than for datasets-server. But... I'm not sure we want to increase the message length AGAIN

https://github.com/huggingface/moon-landing/pull/8593#issuecomment-1883553457

Currently, on https://huggingface.co/datasets/timm/imagenet-12k-wds:

Size of the auto-converted Parquet files (First 5GB per split):
9.99 GB

where First 5GB must be understood as First 5GB of uncompressed data.

What should we show?

lhoestq commented 8 months ago

Maybe

Size of the auto-converted Parquet files (First 5GB per split) (?):
9.99 GB

Parquet export size (First 5GB per split) (?):
9.99 GB

with extra info on hover that shows "The Parquet export only contains the first 5GB of each split (uncompressed)"

polinaeterna commented 8 months ago

otherwise it could be possible to have a 1TB dataset stored in a 1MB Parquet file and most of our jobs couldn't handle it

@lhoestq how is this possible?

polinaeterna commented 8 months ago

Parquet export size (First 5GB per split) (?):
9.99 GB
with extra info on hover that shows "The Parquet export only contains the first 5GB of each split (uncompressed)"

i like this

lhoestq commented 8 months ago

otherwise it could be possible to have a 1TB dataset stored in a 1MB Parquet file and most of our jobs couldn't handle it

@lhoestq how is this possible?

I made up the numbers but if a dataset is made of 1TB of the same data, then a compression algorithm can theoretically compress the data to just two items (value, length) which could fit in <1MB.

Anyway this was just a way to illustrate than the compressed size of data can't give much information abotu the real data size in general.