sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

easy HF dataset doremi? #10

Open brando90 opened 1 year ago

brando90 commented 1 year ago

Is there a data set compatible with HF I may use?

dataset = load_dataset("c4", "en", streaming=True, split="train").with_format("torch") remove_columns = ["text", "timestamp", "url"] but instead have

dataset = load_dataset("doremi", "en", streaming=True, split="train").with_format("torch") remove_columns = ["text", "timestamp", "url"] thus automatically using the doremi weights?

brando90 commented 1 year ago

https://huggingface.co/papers/2305.10429

sangmichaelxie commented 11 months ago

we don't currently have such a dataset on huggingface, but we will let you know if we decide to do so! One issue is that the weights are on the chunk level, meaning that we are weighting sampling probability for the tokenized examples (not the raw documents).