huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.15k stars 2.67k forks source link

New Preprocessing Feature - Deduplication [Request] #4448

Open yuvalkirstain opened 2 years ago

yuvalkirstain commented 2 years ago

Is your feature request related to a problem? Please describe. Many large datasets are full of duplications and it has been shown that deduplicating datasets can lead to better performance while training, and more truthful evaluation at test-time.

A feature that allows one to easily deduplicate a dataset can be cool!

Describe the solution you'd like We can define a function and keep only the first/last data-point that yields the value according to this function.

Describe alternatives you've considered The clear alternative is to repeat a clear boilerplate every time someone want to deduplicate a dataset.

mariosasko commented 2 years ago

Hi! The datasets_sql package lets you easily find distinct rows in a dataset (an example with SELECT DISTINCT is in the readme). Deduplication is (still) not part of the official API because it's hard to implement for datasets bigger than RAM while only using the native PyArrow ops.

(Btw, this is a duplicate of https://github.com/huggingface/datasets/issues/2514)

cceyda commented 1 year ago

Here is an example using the datasets_sql mentioned

from datasets_sql import query

dataset = load_dataset("imdb", split="train")

# If you dont have an id column just add one by enumerating
dataset=dataset.add_column("id", range(len(dataset)))

id_column='id'
unique_column='text'

# always selects min id
unique_dataset = query(f"SELECT dataset.* FROM dataset JOIN (SELECT MIN({id_column}) as unique_id FROM dataset group by {unique_column}) ON unique_id=dataset.{id_column}")

Not ideal for large datasets but good enough for basic cases. Sure would be nice to have in the library 🤗