togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

where should I go to get the file about "domain_to_category_id.json"? #87

Closed suolyer closed 7 months ago

suolyer commented 7 months ago

Hello, where should I go to get the file about "domain_to_category_id.json"?

def load_bad_urls_index(bad_urls_dir: Path) -> Dict[str, int]:
    with open(bad_urls_dir / "domain_to_category_id.json", "r") as f:
        domain_to_category_id = json.load(f)
    return domain_to_category_id

https://github.com/togethercomputer/RedPajama-Data/blob/26c5417e2fc3391ff2e8b19ffcf5521e9ca8def8/app/src/core/quality_signals/utils/content.py#L8