`dataset = dataset.map()` causes faiss index lost

Oaklight commented 2 years ago

Describe the bug

assigning the resulted dataset to original dataset causes lost of the faiss index

Steps to reproduce the bug

my_dataset is a regular loaded dataset. It's a part of a customed dataset structure

self.dataset.add_faiss_index('embeddings')
self.dataset.list_indexes()
# ['embeddings']

dataset2 = my_dataset.map(
    lambda x: self._get_nearest_examples_batch(x['text']), batch=True
)

# the unexpected result:
dataset2.list_indexes()
# []

self.dataset.list_indexes()
# ['embeddings']

in case something wrong with my _get_nearest_examples_batch(), it's like this

def _get_nearest_examples_batch(self, examples, k=5):
    queries = embed(examples)
    scores_batch, retrievals_batch = self.dataset.get_nearest_examples_batch(self.faiss_column, queries, k)
    return {
        'neighbors': [batch['text'] for batch in retrievals_batch],
        'scores': scores_batch
    }

Expected results

map shouldn't drop the indexes, in another word, indexes should be carried to the generated dataset

Actual results

map drops the indexes

Environment info

datasets version: 1.18.3
Platform: Ubuntu 20.04.3 LTS
Python version: 3.8.12
PyArrow version: 7.0.0

lhoestq commented 2 years ago

Hi ! Indeed map is dropping the index right now, because one can create a dataset with more or fewer rows using map (and therefore the index might not be relevant anymore)

I guess we could check the resulting dataset length, and if the user hasn't changed the dataset size we could keep the index, what do you think ?

cceyda commented 2 years ago

doing .add_column("x",x_data) also removes the index. the new column might be irrelevant to the index so I don't think it should drop.

Minimal example

from datasets import load_dataset
import numpy as np

data=load_dataset("ceyda/cats_vs_dogs_sample") #just a test dataset
data=data["train"]
embd_data=data.map(lambda x: {"emb":np.random.uniform(-1,0,50).astype(np.float32)})
embd_data.add_faiss_index(column="emb")
print(embd_data.list_indexes())
embd_data=embd_data.add_column("x",[0]*data.num_rows)
print(embd_data.list_indexes())

lhoestq commented 2 years ago

I agree add_column shouldn't drop the index indeed ! Is it something you'd like to contribute ? I think it's just a matter of copying the self._indexes dictionary to the output dataset

huggingface / datasets