togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

regarding to deduplication #79

Open kimcando opened 8 months ago

kimcando commented 8 months ago

Hey,

thank you in advance for your great work and sharing the data :) I read README and huggingface details and was unclear whether fuzzy deduplication is actually done on this dataset. I understand that

Therefore, my question is the provided dataset is the one that fuzzy deduplication is also applied? If so, could you please share the info that how many cores(if under distributed environments, how many and which type instances) you use? and how long does it take?

Cheeeers!!

ManuelFay commented 8 months ago

+1 - given the fuzzy deduplication hashes, is there a simple/suggested way to cluster and sample them ?

Thanks for thegreat work !

mauriceweber commented 7 months ago

Hi @kimcando and @ManuelFay and thanks for your questions!

bloomfilter, which is EXACT MATCH, seems to be clearly applied.( huggingface data creation part : Finally, the documents were deduplicated based on the text, using a Bloomfilter.)

Yes, we ran the entire dataset through a Bloomfilter for exact deduplication and published the duplicate ids as separate files (mirroring the dataset structure). Important to note is that the duplicates were deliberately kept in the dataset so that everyone can experiment with and study duplication in the training data.

meta data provides several threshold based hash fingerprint and the article says anyone can process fuzzy deduplication. The unclear part is that it seems that you applied fuzzy deduplication when you train your model, but this shared dataset is the version before fuzzy deduplication applied on.

This is correct, we compute the minhash signatures in the same pass as the other quality signals. Note that this is just the signatures; to do fuzzy deduplication, you need to run LSH on these (see below on how to run this).

is the provided dataset is the one that fuzzy deduplication is also applied?

the dataset we provide comes with the minhash signatures, but not with the deduplication clusters. These need to be computed using the script in app/src/run_lsh.py.

Here is a minimal example you can run from the root of RedPajama-Data:

1) Download listings

DATA_ROOT="${HOME}/path/to/data" # make sure this is an absolute path
mkdir -p "${DATA_ROOT}/listings"
listings_file="listings/en-2023-06-head_middle.txt"
wget "https://data.together.xyz/redpajama-data-v2/v1.0.0/${listings_file}" -O "${DATA_ROOT}/${listings_file}"

2) Download MinHash signatures

# read the first 5 lines here to run the example
head -n5 "${DATA_ROOT}/${listings_file}" | while read line; 
do
    url="https://data.together.xyz/redpajama-data-v2/v1.0.0/minhash/${line}.minhash.parquet"
    dest="${DATA_ROOT}/minhash/${line}.minhash.parquet"
    mkdir -p $(dirname $dest)
    wget "$url" -O "$dest"
    echo "minhash/${line}.minhash.parquet" >> "${DATA_ROOT}/minhash_listings.txt"
done

3) Run LSH at similarity level 0.7

cd app/
python3 src/run_lsh.py --input_base_uri "file://${DATA_ROOT}/" --output_dir "${DATA_ROOT}/minhash_clusters/" --similarity 0.7 --num_perm 128 --listings "${DATA_ROOT}/minhash_listings.txt"

This will result in one parquet file for each input file, containing the MinHash cluster id for every (fuzzy duplicate) document in the corresponding documents file.

kimcando commented 7 months ago

Hi @kimcando and @ManuelFay and thanks for your questions!

bloomfilter, which is EXACT MATCH, seems to be clearly applied.( huggingface data creation part : Finally, the documents were deduplicated based on the text, using a Bloomfilter.)

Yes, we ran the entire dataset through a Bloomfilter for exact deduplication and published the duplicate ids as separate files (mirroring the dataset structure). Important to note is that the duplicates were deliberately kept in the dataset so that everyone can experiment with and study duplication in the training data.

meta data provides several threshold based hash fingerprint and the article says anyone can process fuzzy deduplication. The unclear part is that it seems that you applied fuzzy deduplication when you train your model, but this shared dataset is the version before fuzzy deduplication applied on.

This is correct, we compute the minhash signatures in the same pass as the other quality signals. Note that this is just the signatures; to do fuzzy deduplication, you need to run LSH on these (see below on how to run this).

is the provided dataset is the one that fuzzy deduplication is also applied?

the dataset we provide comes with the minhash signatures, but not with the deduplication clusters. These need to be computed using the script in app/src/run_lsh.py.

Here is a minimal example you can run from the root of RedPajama-Data:

1) Download listings

DATA_ROOT="${HOME}/path/to/data" # make sure this is an absolute path
mkdir -p "${DATA_ROOT}/listings"
listings_file="listings/en-2023-06-head_middle.txt"
wget "https://data.together.xyz/redpajama-data-v2/v1.0.0/${listings_file}" -O "${DATA_ROOT}/${listings_file}"

2) Download MinHash signatures

# read the first 5 lines here to run the example
head -n5 "${DATA_ROOT}/${listings_file}" | while read line; 
do
    url="https://data.together.xyz/redpajama-data-v2/v1.0.0/minhash/${line}.minhash.parquet"
    dest="${DATA_ROOT}/minhash/${line}.minhash.parquet"
    mkdir -p $(dirname $dest)
    wget "$url" -O "$dest"
    echo "minhash/${line}.minhash.parquet" >> "${DATA_ROOT}/minhash_listings.txt"
done

3) Run LSH at similarity level 0.7

cd app/
python3 src/run_lsh.py --input_base_uri "file://${DATA_ROOT}/" --output_dir "${DATA_ROOT}/minhash_clusters/" --similarity 0.7 --num_perm 128 --listings "${DATA_ROOT}/minhash_listings.txt"

This will result in one parquet file for each input file, containing the MinHash cluster id for every (fuzzy duplicate) document in the corresponding documents file.

Thanks for replying. But for the provided deduplication is tested on only 200M documents which is a significantly small number given the all number of documents 100B and that 200M documents can be easily deduplicated using the well known libraries(For instance, let say you guys used 80 snapshots -> then each single index is approximately 1.25B docs . The 200M documents is only about less than 20% of the single index.) However, tackling a large volume is another problem.

Therfore, my question is when Redpajama-V2 is used for training models, then the considerable amount of datasets must be deduplicated, and at that situation(e.g, to handle 20 trillion tokens) could you give me some hints how many cores have you used?

mauriceweber commented 7 months ago

absolutely, the current LSH implementation does not scale to the entire dataset. I think to do full fuzzy deduplication, you will need to use multiple nodes (the implementation of MinHashLSH provided by BigCode here is probably a good starting point).

With that said, a way forward with the single node LSH implementation in src/run_lsh.py would be to first reduce the number of documents using exact dedupe and quality filtering to get a smaller dataset and only then run LSH.

To run LSH on 200M documents with used a machine with 500GB RAM and 64 cores, and it took ~40 minutes. The exact (.wet document hash based) dedupe with the bloomfilter ran on the same machine in ~3.5 days for the 25B english documents.

edwardzjl commented 1 month ago

Is it possible to replace minhash with simhash? IIRC, dedup on exact match of simhash signatures is sufficient to remove near-duplicate documents.

mauriceweber commented 2 weeks ago

Hi @edwardzjl , you can use simhash for near deduplication but you need to explicitly compute new hashes for that.