Open yuvalkirstain opened 2 years ago
Hi! The datasets_sql package lets you easily find distinct rows in a dataset (an example with SELECT DISTINCT
is in the readme). Deduplication is (still) not part of the official API because it's hard to implement for datasets bigger than RAM while only using the native PyArrow ops.
(Btw, this is a duplicate of https://github.com/huggingface/datasets/issues/2514)
Here is an example using the datasets_sql mentioned
from datasets_sql import query
dataset = load_dataset("imdb", split="train")
# If you dont have an id column just add one by enumerating
dataset=dataset.add_column("id", range(len(dataset)))
id_column='id'
unique_column='text'
# always selects min id
unique_dataset = query(f"SELECT dataset.* FROM dataset JOIN (SELECT MIN({id_column}) as unique_id FROM dataset group by {unique_column}) ON unique_id=dataset.{id_column}")
Not ideal for large datasets but good enough for basic cases. Sure would be nice to have in the library 🤗
Is your feature request related to a problem? Please describe. Many large datasets are full of duplications and it has been shown that deduplicating datasets can lead to better performance while training, and more truthful evaluation at test-time.
A feature that allows one to easily deduplicate a dataset can be cool!
Describe the solution you'd like We can define a function and keep only the first/last data-point that yields the value according to this function.
Describe alternatives you've considered The clear alternative is to repeat a clear boilerplate every time someone want to deduplicate a dataset.