huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.92k stars 134 forks source link

Sentence deduplication output #261

Closed solene-evain closed 1 week ago

solene-evain commented 1 month ago

Hi,

I started to use datatrove for deduplication. If I managed to understand the minhash_deduplication script, I've got difficulties understanding the outputs of sentence_deduplication.py.

All I obtain are 'intermediate', 'sent_dups' and 'sent_sigs' folders.

1/ 'sent_sigs' is supposed to contain a signature for each document. I've got 15 docs, and only 9 output folders in here, with 3 c4_sig files in each that I can't read.

2/ 'sent_dups' contains also 9 folders, where I've got 2 c4_dup files in each. what does these files contain extactly?

3/ where is the output of SentenceDedupFilter? The final stats seem to be okay : " Stats: {total: 15, doc_len: 259 [min=33, max=106, 64.75±35/doc], removed_sentences: 32 [min=2, max=5, 2.91±1/doc], original_sentences: 36 [min=2, max=5, 3.27±1/doc]}" but I can't exactly figure out how since I can't find any new version of the documents with the removed_sentences.

Could you provide any help on that? Thanks

guipenedo commented 1 month ago

Hi, can you share the script you used? normally you would add a Writer after the filter and choose yourself where to save the final output

solene-evain commented 1 month ago

Yes sure!

I used this script: https://github.com/huggingface/datatrove/blob/b5443d2b8ef473262bc97b3d7717a217b6eaf1f3/examples/sentence_deduplication.py

That I modified like this: `from datatrove.executor.base import PipelineExecutor from datatrove.executor.local import LocalPipelineExecutor from datatrove.pipeline.dedup import SentenceDedupFilter, SentenceDedupSignature, SentenceFindDedups from datatrove.pipeline.dedup.sentence_dedup import SentDedupConfig from datatrove.pipeline.extractors import Trafilatura from datatrove.pipeline.filters import GopherQualityFilter, LanguageFilter from datatrove.pipeline.readers import JsonlReader, WarcReader from datatrove.pipeline.writers.jsonl import JsonlWriter from datatrove.utils.typeshelper import Languages from datatrove.pipeline.writers.disk_base import DiskWriter

""" example on how to use sentence-deduplication. sentence-deduplication implements deduplication as in: https://jmlr.org/papers/v21/20-074.html 'To deduplicate the data set, we discarded all but one of any three-sentence span occurring more than once in the data set.'

to run deduplication we need to run three different pipelines, pipeline 1: implements usual extraction + quality filtering, it ends with SentenceDedupSignature, preprended by a writer. pipeline 2: implements only SentenceFindDedups pipeline 3: implements SentenceDedupFilter prepended by a reader of the same writer-kind used during stage 1. after the SentenceDedupFilter. """

modify sentence dedup hyper params here

sent_dedup_config = SentDedupConfig( n_sentences=2, split_sentences=True, # set to False to split on \n instead only_dedup_in_index=True, min_doc_words=1, )

FINDER_WORKERS = 10 # this will speed up/parallelize step 2

1. create a signature for each sentence in each doc

def run_example(): pipeline_1 = [

WarcReader(data_folder="warc/", limit=1000),

    JsonlReader(data_folder="./", paths_file="path_file.txt"),
    #Trafilatura(),
    #GopherQualityFilter(min_stop_words=0),
    #LanguageFilter(language_threshold=0.5, languages=(Languages.english,)),
    JsonlWriter("sd_out/intermediate/"),
    SentenceDedupSignature(output_folder="sd_out/sent_sigs/", config=sent_dedup_config, language=Languages.french, finder_workers=FINDER_WORKERS),
]

2. reads all the signatures and loads them to check for duplicates.

pipeline_2 = [SentenceFindDedups(data_folder="sd_out/sent_sigs/", output_folder="sd_out/sent_dups/", config=sent_dedup_config)]

3. reads a documentpipeline and removes duplicated sentences found before

pipeline_3 = [
    JsonlReader(data_folder="sd_out/intermediate/"),
    SentenceDedupFilter(data_folder="sd_out/sent_dups/", config=sent_dedup_config, language=Languages.french),
]

executor_1: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_1, workers=4, tasks=4)

executor_2: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_2, workers=1, tasks=FINDER_WORKERS)

executor_3: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_3, workers=4, tasks=4)

print(executor_1.run())
print(executor_2.run())
print(executor_3.run())

if name == "main": run_example() `

guipenedo commented 1 month ago

You're indeed missing a writer after the filter:

pipeline_3 = [
JsonlReader(data_folder="sd_out/intermediate/"),
SentenceDedupFilter(data_folder="sd_out/sent_dups/", config=sent_dedup_config, language=Languages.french),
JsonlWriter(data_folder="sd_out/final_output/")
]
solene-evain commented 1 month ago

Thank you! Now I've got the keys to understand the deduplication process 🙏

solene-evain commented 1 month ago

Suggestion: maybe this should be added to the original script https://github.com/huggingface/datatrove/blob/b5443d2b8ef473262bc97b3d7717a217b6eaf1f3/examples/sentence_deduplication.py as it is missing there too!

guipenedo commented 1 week ago

Good catch, I've added it