Closed solene-evain closed 2 months ago
Hi, can you share the script you used? normally you would add a Writer after the filter and choose yourself where to save the final output
Yes sure!
I used this script: https://github.com/huggingface/datatrove/blob/b5443d2b8ef473262bc97b3d7717a217b6eaf1f3/examples/sentence_deduplication.py
That I modified like this: `from datatrove.executor.base import PipelineExecutor from datatrove.executor.local import LocalPipelineExecutor from datatrove.pipeline.dedup import SentenceDedupFilter, SentenceDedupSignature, SentenceFindDedups from datatrove.pipeline.dedup.sentence_dedup import SentDedupConfig from datatrove.pipeline.extractors import Trafilatura from datatrove.pipeline.filters import GopherQualityFilter, LanguageFilter from datatrove.pipeline.readers import JsonlReader, WarcReader from datatrove.pipeline.writers.jsonl import JsonlWriter from datatrove.utils.typeshelper import Languages from datatrove.pipeline.writers.disk_base import DiskWriter
""" example on how to use sentence-deduplication. sentence-deduplication implements deduplication as in: https://jmlr.org/papers/v21/20-074.html 'To deduplicate the data set, we discarded all but one of any three-sentence span occurring more than once in the data set.'
to run deduplication we need to run three different pipelines, pipeline 1: implements usual extraction + quality filtering, it ends with SentenceDedupSignature, preprended by a writer. pipeline 2: implements only SentenceFindDedups pipeline 3: implements SentenceDedupFilter prepended by a reader of the same writer-kind used during stage 1. after the SentenceDedupFilter. """
sent_dedup_config = SentDedupConfig( n_sentences=2, split_sentences=True, # set to False to split on \n instead only_dedup_in_index=True, min_doc_words=1, )
FINDER_WORKERS = 10 # this will speed up/parallelize step 2
def run_example(): pipeline_1 = [
JsonlReader(data_folder="./", paths_file="path_file.txt"),
#Trafilatura(),
#GopherQualityFilter(min_stop_words=0),
#LanguageFilter(language_threshold=0.5, languages=(Languages.english,)),
JsonlWriter("sd_out/intermediate/"),
SentenceDedupSignature(output_folder="sd_out/sent_sigs/", config=sent_dedup_config, language=Languages.french, finder_workers=FINDER_WORKERS),
]
pipeline_2 = [SentenceFindDedups(data_folder="sd_out/sent_sigs/", output_folder="sd_out/sent_dups/", config=sent_dedup_config)]
pipeline_3 = [
JsonlReader(data_folder="sd_out/intermediate/"),
SentenceDedupFilter(data_folder="sd_out/sent_dups/", config=sent_dedup_config, language=Languages.french),
]
executor_1: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_1, workers=4, tasks=4)
executor_2: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_2, workers=1, tasks=FINDER_WORKERS)
executor_3: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_3, workers=4, tasks=4)
print(executor_1.run())
print(executor_2.run())
print(executor_3.run())
if name == "main": run_example() `
You're indeed missing a writer after the filter:
pipeline_3 = [
JsonlReader(data_folder="sd_out/intermediate/"),
SentenceDedupFilter(data_folder="sd_out/sent_dups/", config=sent_dedup_config, language=Languages.french),
JsonlWriter(data_folder="sd_out/final_output/")
]
Thank you! Now I've got the keys to understand the deduplication process 🙏
Suggestion: maybe this should be added to the original script https://github.com/huggingface/datatrove/blob/b5443d2b8ef473262bc97b3d7717a217b6eaf1f3/examples/sentence_deduplication.py as it is missing there too!
Good catch, I've added it
Hi,
I started to use datatrove for deduplication. If I managed to understand the minhash_deduplication script, I've got difficulties understanding the outputs of sentence_deduplication.py.
All I obtain are 'intermediate', 'sent_dups' and 'sent_sigs' folders.
1/ 'sent_sigs' is supposed to contain a signature for each document. I've got 15 docs, and only 9 output folders in here, with 3 c4_sig files in each that I can't read.
2/ 'sent_dups' contains also 9 folders, where I've got 2 c4_dup files in each. what does these files contain extactly?
3/ where is the output of SentenceDedupFilter? The final stats seem to be okay : " Stats: {total: 15, doc_len: 259 [min=33, max=106, 64.75±35/doc], removed_sentences: 32 [min=2, max=5, 2.91±1/doc], original_sentences: 36 [min=2, max=5, 3.27±1/doc]}" but I can't exactly figure out how since I can't find any new version of the documents with the removed_sentences.
Could you provide any help on that? Thanks