pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
2.29k stars 170 forks source link

[Feature] Add fineweb dataset #309

Closed viai957 closed 5 months ago

viai957 commented 5 months ago

Current this only supports c4 mini and C4 dataset I would love see fineweb dataset support

XinDongol commented 4 months ago

just use load_from_disk and pass the path of your fineweb. It works fine with me.