nomic-ai / contrastors

Train Models Contrastively in Pytorch
Apache License 2.0
459 stars 35 forks source link

Filtering Data For Contrastive Pretraining #43

Closed daegonYu closed 3 weeks ago

daegonYu commented 3 weeks ago

hello. The command to run in the Filtering Data For Contrastive Pretraining section of https://github.com/nomic-ai/contrastors/tree/main/scripts/text is

torchrun --nproc-per-node=<num_gpus> --dataset=< path_to_dataset_files_or_directory> --output_dir=<path_where_to_save_filtered_dataset> --query_key=<query_key_of_jsonl_file> --document_key=<document_of_key_jsonl_file>

Can I know which python file is being executed?

zanussbaum commented 3 weeks ago

ah thanks for the headsup, it should be index_filtering.py. I've updated the README