togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.57k stars 350 forks source link

Step 2) "Invalid option: ---input_base_uri" #107

Open timpal0l opened 8 months ago

timpal0l commented 8 months ago
bash scripts/apptainer_run_quality_signals.sh \
  --config configs/rp_v2.0.conf \
  --dump_id "2022-49" \
  --input_base_uri "file:///path/to/data/root" \
  --output_base_uri "file:///path/to/outout/data/root" \
  --max_docs -1

Invalid option: ---input_base_uri Usage: apptainer_run_quality_signals.sh [ -c | --config ] [ -d | --dump_id ]

mauriceweber commented 8 months ago

good catch, thanks for reporting! The three flags --input_base_uri, --output_base_uri and --max_docs are actually set in the config file: https://github.com/togethercomputer/RedPajama-Data/blob/bb594b01a92b7e6fcf70cf3b6659851ce17edcce/configs/rp_v2.0.conf#L4-L6

You can just drop them in the call to the apptainer script.