huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19k stars 2.63k forks source link

Can datasets remove duplicated rows? #2514

Open liuxinglan opened 3 years ago

liuxinglan commented 3 years ago

Is your feature request related to a problem? Please describe. i find myself more and more relying on datasets just to do all the preprocessing. One thing however, for removing duplicated rows, I couldn't find out how and am always converting datasets to pandas to do that..

Describe the solution you'd like have a functionality of " remove duplicated rows"

Describe alternatives you've considered convert dataset to pandas, remove duplicate, and convert back...

Additional context no

lhoestq commented 3 years ago

Hi ! For now this is probably the best option. We might add a feature like this in the feature as well.

Do you know any deduplication method that works on arbitrary big datasets without filling up RAM ? Otherwise we can have do the deduplication in memory like pandas but I feel like this is going to be limiting for some cases

mariosasko commented 3 years ago

Yes, I'd like to work on this feature once I'm done with #2500, but first I have to do some research, and see if the implementation wouldn't be too complex.

In the meantime, maybe this lib can help. However, note that this lib operates directly on pyarrow tables and relies only on hash to find duplicates (e.g. -1 and -2 have the same hash in Python 3, so this lib will treat them as duplicates), which doesn't make much sense.

liuxinglan commented 3 years ago

Hi ! For now this is probably the best option. We might add a feature like this in the feature as well.

Do you know any deduplication method that works on arbitrary big datasets without filling up RAM ? Otherwise we can have do the deduplication in memory like pandas but I feel like this is going to be limiting for some cases

Great if this is can be done. Thanks!!

Not sure if you are asking me. In any case I don't know of any unfortunately :( in practice if data is really large we normally do it with spark (only for info. I understand this is not useful in developing this library..)

Dref360 commented 2 years ago

Hello,

I'm also interested in this feature. Has there been progress on this issue?

Could we use a similar trick as above, but with a better hashing algorithm like SHA?

We could also use a bloom filter, should we care a lot about collision in this case?

Dref360 commented 2 years ago

For reference, we can get a solution fairly easily if we assume that we can hold in memory all unique values.

from datasets import Dataset
from itertools import cycle
from functools import partial

memory = set()
def is_unique(elem:Any , column: str, memory: set) -> bool:
    if elem[column] in memory:
        return False
    else:
        memory.add(elem[column])
        return True

# Example dataset
ds = Dataset.from_dict({"col1" : [sent for i, sent in zip(range(10), cycle(["apple", "orange", "pear"]))],
                                      "col2": [i % 5 for i in range(10)]})

# Drop duplicates in `ds` on "col1"
ds2 = ds.filter(partial(is_unique, column="col1", memory=memory))

Of course, we can improve the API so that we can introduce Dataset.drop_duplicates. For the parallel version, we can use a shared memory set.

lvwerra commented 2 years ago

An approach that works assuming you can hold the all the unique document hashes in memory:

from datasets import load_dataset

def get_hash(example):
    """Get hash of content field."""
    return {"hash": hash(example["content"])} # can use any hashing function here

def check_uniques(example, uniques):
    """Check if current hash is still in set of unique hashes and remove if true."""
    if example["hash"] in uniques:
        uniques.remove(example["hash"])
        return True
    else:
        return False

ds = load_dataset("some_dataset")
ds = ds.map(get_hash)
uniques = set(ds.unique("hash"))
ds_filter = ds.filter(check_uniques, fn_kwargs={"uniques": uniques})

If the uniques could be stored in arrow then no additional memory would used at all but I don't know if this is possible.

StephennFernandes commented 2 years ago

@lvwerra hey, could you tell me how reliable is this deduplication method. i am currently using the same deduplication strategy to deduplicate a large text corpus to pretrain LLMs ~ 11B to 20B. just needed to ensure if this strategy would be fine on large datasets for LLMs pretraining.

Manel-Hik commented 7 months ago

Hi @StephennFernandes I'm also trying to pretrain an llm, and need to do deduplication for my dataset, which method you applied please?

StephennFernandes commented 7 months ago

Hey @Manel-Hik

The following is a simpler yet really effective deduplication code that i has used in the past.

given that I have limited training corpus for the languages I wanted to train i relied on this code. https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned/blob/main/deduplicate.py

for more robust and stronger deduplication, refer to this huggingface repo, that's newly released: https://github.com/huggingface/datatrove

Manel-Hik commented 7 months ago

Thanks a lot Sure I will check it @StephennFernandes

fzyzcjy commented 5 months ago

Hi, is there any updates? Thanks!

Dref360 commented 1 month ago

Update July 2024

PyArrow now supports first/last aggregation which would allow us to implement this functionality. Link

so if we want to move in this direction we can :) Is that something we want to do? Would be happy to contribute.