togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

what's the specific meaning of dsir? #99

Open BBetteroff opened 5 months ago

BBetteroff commented 5 months ago

I am trying to reproduce this repo on my macOS, and I don't have a aws account .can i get your help, i'd appreciate it

截屏2024-01-16 14 20 30
mauriceweber commented 5 months ago

Hi @BBetteroff , DSIR stands for "Data Selection with Importance Resampling" (see paper here) and is used to compute importance weights for each sample with respect to different target domains.

The screenshot you posted is from the prep_artifacts.py scripts. The flag --dsir_num_samples corresponds to the number of samples you use from the target domain. The flag --dsir_feature_dim corresponds to the dimension of the feature vector used to fit the bag-of-ngram model based on which dsir importance weights are computed. If you look into the default config, you can see that the default values used are 500k samples and dimension 10k: https://github.com/togethercomputer/RedPajama-Data/blob/bb594b01a92b7e6fcf70cf3b6659851ce17edcce/configs/rp_v2.0.conf#L31-L33

BBetteroff commented 5 months ago

Hi @BBetteroff , DSIR stands for "Data Selection with Importance Resampling" (see paper here) and is used to compute importance weights for each sample with respect to different target domains.

The screenshot you posted is from the prep_artifacts.py scripts. The flag --dsir_num_samples corresponds to the number of samples you use from the target domain. The flag --dsir_feature_dim corresponds to the dimension of the feature vector used to fit the bag-of-ngram model based on which dsir importance weights are computed. If you look into the default config, you can see that the default values used are 500k samples and dimension 10k:

https://github.com/togethercomputer/RedPajama-Data/blob/bb594b01a92b7e6fcf70cf3b6659851ce17edcce/configs/rp_v2.0.conf#L31-L33

Thanks! I'll keeping reproducing this repo and talking to you.

BBetteroff commented 5 months ago

what‘s the content of listing file?,can you show me a example? and what's the use?

mauriceweber commented 5 months ago

The listing files contain the ids of inputs which, when concatenated with the base uri point to the location of the data. For example:

2023-06/0000/de_head.json.gz
2023-06/0000/de_middle.json.gz
2023-06/0000/de_tail.json.gz
2023-06/0000/en_head.json.gz
2023-06/0000/en_middle.json.gz
2023-06/0000/en_tail.json.gz
2023-06/0000/es_head.json.gz

For example, if your data is stored locally under, e.g., /data/documents/2023-06/0000/de_middle.json.gz you would use file:///data/documents/ as the base uri.