Closed LingxiaoShawn closed 3 months ago
@LingxiaoShawn thanks for posting your question. Can you provide more details about how you did the conversion? Maybe post the script you use and I can help take a look.
Basically if you saw MDS is 20x larger than the original jsonl dataset, that means something goes wrong. MDS is basically serializing the data into binary, and apply compression (if you specify compress mode, zstd for example). MDS does create some extra metadata, called index.json, which enables random access and elastic resumptions, but the meta data size runs from only a few KB to several MB.
Hey @LingxiaoShawn, can you clarify which datatypes you are using for encoding the sample? As @XiaohanZhangCMU says, MDS is performing simple serialization.
@XiaohanZhangCMU @snarayan21 Thank you for your timely response. Sure here is what I did:
build_hf_dataset
function from here and the generate_samples
function from here
concat_tokens = 2048
tokenizer_type = 'EleutherAI/gpt-neox-20b'
eos_text = '<|endoftext|>'
num_threads = 32
columns = {'tokens': 'ndarray:int32'}
tokenizer = AutoTokenizer.from_pretrained(tokenizer_type) tokenizer.model_max_length = int(1e30)
dataset = build_hf_dataset(files, mode=ConcatMode.CONCAT_TOKENS, max_length=concat_tokens, eos_text=eos_text, tokenizer=tokenizer, num_threads=num_threads) dataloader = DataLoader(dataset=dataset, sampler=None, batch_size=1024, num_workers=num_threads) iterator = iter(generate_samples(dataloader))
with MDSWriter(columns=columns, out=out_path) as out: for sample in tqdm(iterator): out.write(sample)
where files is a list of paths, I have tried the RedPajama-v1 dataset, hence each path point to a jsonl file of the dataset.
I do found that the size of the tranformed dataset is significantly larger. Do you have any clue for this?
I think I have figured out the reason: there is a bug behind the ConcatTokensDataset, which does not support multi workers. For IterableDataset, the doc has mentioned the issue. https://discuss.pytorch.org/t/iterable-pytorch-dataset-with-multiple-workers/135475
Not a bug at here, mainly a problem of llm-foundry code.
Hi there,
Thank you for providing the open-source library for preprocessing large-scale datasets. I have a question regarding the size of storage used by the MDS format. Specifically, I have a data in jsonl format and it takes about 70GB, and after transforming to MDS format with shard size = 64 MB, I have found that the total MDS dataset takes about 1.5TB, which is about 20 times larger. I would like to know whether this is a bug or it is something supposed to be true? Also, do you think it is possible to make the size of MDS much similar to the raw dataset?
Thanks!