mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.11k stars 138 forks source link

Too much disk usage after transforming to MDS format #720

Closed LingxiaoShawn closed 3 months ago

LingxiaoShawn commented 3 months ago

Hi there,

Thank you for providing the open-source library for preprocessing large-scale datasets. I have a question regarding the size of storage used by the MDS format. Specifically, I have a data in jsonl format and it takes about 70GB, and after transforming to MDS format with shard size = 64 MB, I have found that the total MDS dataset takes about 1.5TB, which is about 20 times larger. I would like to know whether this is a bug or it is something supposed to be true? Also, do you think it is possible to make the size of MDS much similar to the raw dataset?

Thanks!

XiaohanZhangCMU commented 3 months ago

@LingxiaoShawn thanks for posting your question. Can you provide more details about how you did the conversion? Maybe post the script you use and I can help take a look.

Basically if you saw MDS is 20x larger than the original jsonl dataset, that means something goes wrong. MDS is basically serializing the data into binary, and apply compression (if you specify compress mode, zstd for example). MDS does create some extra metadata, called index.json, which enables random access and elastic resumptions, but the meta data size runs from only a few KB to several MB.

snarayan21 commented 3 months ago

Hey @LingxiaoShawn, can you clarify which datatypes you are using for encoding the sample? As @XiaohanZhangCMU says, MDS is performing simple serialization.

LingxiaoShawn commented 3 months ago

@XiaohanZhangCMU @snarayan21 Thank you for your timely response. Sure here is what I did:

  1. I use the build_hf_dataset function from here and the generate_samples function from here
  2. I then run the following code to transform the format
    
    concat_tokens = 2048
    tokenizer_type = 'EleutherAI/gpt-neox-20b'
    eos_text = '<|endoftext|>'
    num_threads = 32
    columns = {'tokens': 'ndarray:int32'}

tokenizer = AutoTokenizer.from_pretrained(tokenizer_type) tokenizer.model_max_length = int(1e30)

dataset = build_hf_dataset(files, mode=ConcatMode.CONCAT_TOKENS, max_length=concat_tokens, eos_text=eos_text, tokenizer=tokenizer, num_threads=num_threads) dataloader = DataLoader(dataset=dataset, sampler=None, batch_size=1024, num_workers=num_threads) iterator = iter(generate_samples(dataloader))

with MDSWriter(columns=columns, out=out_path) as out: for sample in tqdm(iterator): out.write(sample)


where files is a list of paths, I have tried the RedPajama-v1 dataset, hence each path point to a jsonl file of the dataset. 

I do found that the size of the tranformed dataset is significantly larger. Do you have any clue for this? 
LingxiaoShawn commented 3 months ago

I think I have figured out the reason: there is a bug behind the ConcatTokensDataset, which does not support multi workers. For IterableDataset, the doc has mentioned the issue. https://discuss.pytorch.org/t/iterable-pytorch-dataset-with-multiple-workers/135475

LingxiaoShawn commented 3 months ago

Not a bug at here, mainly a problem of llm-foundry code.