Open liuxinglan opened 3 years ago
Hi ! For now this is probably the best option. We might add a feature like this in the feature as well.
Do you know any deduplication method that works on arbitrary big datasets without filling up RAM ? Otherwise we can have do the deduplication in memory like pandas but I feel like this is going to be limiting for some cases
Yes, I'd like to work on this feature once I'm done with #2500, but first I have to do some research, and see if the implementation wouldn't be too complex.
In the meantime, maybe this lib can help. However, note that this lib operates directly on pyarrow tables and relies only on hash
to find duplicates (e.g. -1
and -2
have the same hash in Python 3, so this lib will treat them as duplicates), which doesn't make much sense.
Hi ! For now this is probably the best option. We might add a feature like this in the feature as well.
Do you know any deduplication method that works on arbitrary big datasets without filling up RAM ? Otherwise we can have do the deduplication in memory like pandas but I feel like this is going to be limiting for some cases
Great if this is can be done. Thanks!!
Not sure if you are asking me. In any case I don't know of any unfortunately :( in practice if data is really large we normally do it with spark (only for info. I understand this is not useful in developing this library..)
Hello,
I'm also interested in this feature. Has there been progress on this issue?
Could we use a similar trick as above, but with a better hashing algorithm like SHA?
We could also use a bloom filter, should we care a lot about collision in this case?
For reference, we can get a solution fairly easily if we assume that we can hold in memory all unique values.
from datasets import Dataset
from itertools import cycle
from functools import partial
memory = set()
def is_unique(elem:Any , column: str, memory: set) -> bool:
if elem[column] in memory:
return False
else:
memory.add(elem[column])
return True
# Example dataset
ds = Dataset.from_dict({"col1" : [sent for i, sent in zip(range(10), cycle(["apple", "orange", "pear"]))],
"col2": [i % 5 for i in range(10)]})
# Drop duplicates in `ds` on "col1"
ds2 = ds.filter(partial(is_unique, column="col1", memory=memory))
Of course, we can improve the API so that we can introduce Dataset.drop_duplicates
.
For the parallel version, we can use a shared memory set.
An approach that works assuming you can hold the all the unique document hashes in memory:
from datasets import load_dataset
def get_hash(example):
"""Get hash of content field."""
return {"hash": hash(example["content"])} # can use any hashing function here
def check_uniques(example, uniques):
"""Check if current hash is still in set of unique hashes and remove if true."""
if example["hash"] in uniques:
uniques.remove(example["hash"])
return True
else:
return False
ds = load_dataset("some_dataset")
ds = ds.map(get_hash)
uniques = set(ds.unique("hash"))
ds_filter = ds.filter(check_uniques, fn_kwargs={"uniques": uniques})
If the uniques
could be stored in arrow then no additional memory would used at all but I don't know if this is possible.
@lvwerra hey, could you tell me how reliable is this deduplication method. i am currently using the same deduplication strategy to deduplicate a large text corpus to pretrain LLMs ~ 11B to 20B. just needed to ensure if this strategy would be fine on large datasets for LLMs pretraining.
Hi @StephennFernandes I'm also trying to pretrain an llm, and need to do deduplication for my dataset, which method you applied please?
Hey @Manel-Hik
The following is a simpler yet really effective deduplication code that i has used in the past.
given that I have limited training corpus for the languages I wanted to train i relied on this code. https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned/blob/main/deduplicate.py
for more robust and stronger deduplication, refer to this huggingface repo, that's newly released: https://github.com/huggingface/datatrove
Thanks a lot Sure I will check it @StephennFernandes
Hi, is there any updates? Thanks!
Is your feature request related to a problem? Please describe. i find myself more and more relying on datasets just to do all the preprocessing. One thing however, for removing duplicated rows, I couldn't find out how and am always converting datasets to pandas to do that..
Describe the solution you'd like have a functionality of " remove duplicated rows"
Describe alternatives you've considered convert dataset to pandas, remove duplicate, and convert back...
Additional context no