Open BBetteroff opened 10 months ago
Hi @BBetteroff , DSIR stands for "Data Selection with Importance Resampling" (see paper here) and is used to compute importance weights for each sample with respect to different target domains.
The screenshot you posted is from the prep_artifacts.py
scripts. The flag --dsir_num_samples
corresponds to the number of samples you use from the target domain. The flag --dsir_feature_dim
corresponds to the dimension of the feature vector used to fit the bag-of-ngram model based on which dsir importance weights are computed. If you look into the default config, you can see that the default values used are 500k samples and dimension 10k:
https://github.com/togethercomputer/RedPajama-Data/blob/bb594b01a92b7e6fcf70cf3b6659851ce17edcce/configs/rp_v2.0.conf#L31-L33
Hi @BBetteroff , DSIR stands for "Data Selection with Importance Resampling" (see paper here) and is used to compute importance weights for each sample with respect to different target domains.
The screenshot you posted is from the
prep_artifacts.py
scripts. The flag--dsir_num_samples
corresponds to the number of samples you use from the target domain. The flag--dsir_feature_dim
corresponds to the dimension of the feature vector used to fit the bag-of-ngram model based on which dsir importance weights are computed. If you look into the default config, you can see that the default values used are 500k samples and dimension 10k:
Thanks! I'll keeping reproducing this repo and talking to you.
what‘s the content of listing file?,can you show me a example? and what's the use?
The listing files contain the ids of inputs which, when concatenated with the base uri point to the location of the data. For example:
2023-06/0000/de_head.json.gz
2023-06/0000/de_middle.json.gz
2023-06/0000/de_tail.json.gz
2023-06/0000/en_head.json.gz
2023-06/0000/en_middle.json.gz
2023-06/0000/en_tail.json.gz
2023-06/0000/es_head.json.gz
For example, if your data is stored locally under, e.g., /data/documents/2023-06/0000/de_middle.json.gz
you would use file:///data/documents/
as the base uri.
I am trying to reproduce this repo on my macOS, and I don't have a aws account .can i get your help, i'd appreciate it