togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Update README.md #85

Closed mauriceweber closed 7 months ago

mauriceweber commented 7 months ago

The title on deduplication in the readme only referred to fuzzy deduplication, but should cover both fuzzy and exact deduplication.