togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

cc_net processing local wet file #78

Closed hicotton02 closed 8 months ago

hicotton02 commented 8 months ago

I'm creating this issue because I'm not sure the one below is visible.

In the instructions for cc_net, we know that after we download the 300k to 38M files for the wikipedia_warc file, and then convert it to a wet file, there is a step to "run the cc_net pipeline on the wet file" and there isnt a clear path to do that. @mauriceweber you stated that we need to modify line 38 in process_wet_file.py and I have been attempting to figure out what to put there, with no success yet. I am still trying to figure it out, but if you have time, can you throw some quick instructions on how to pass a local wet file into cc_net?

personally, I have all the data downloaded and deduped for all the datasets ready to create a model except for this. I really appreciate your guys' efforts in making this available so that we can learn!

hicotton02 commented 8 months ago

Here are some things I tried:

f = jsonql.open_remote_file("https://1drv.ms/u/s!AnGdIsjlgdZcmuRf4ka7Gujq11c0wg?e=dcTFQZ", cache=Path("/nfs/slow/RedPajama-Data/data_prep/cc/wet/warc_wikipedia_file.warc.wet.gz"))

f = jsonql.open_remote_file(None, cache=Path("/nfs/slow/RedPajama-Data/data_prep/cc/wet/warc_wikipedia_file.warc.wet.gz"))

I was working under the assumption that if we pass in the cache path, we can trick cc_net to assume the file has already been dowloaded. I also uploaded the wet.gz file to onedrive and created an anonymous link as you can see so to see if it needed to download something. but that doesnt seem to work either.

BTW for anyone that is looking, the onedrive link will work if you want to skip past this part:

oh.... you guys updated to V2....