Open Oaklight opened 2 years ago
Hi ! Indeed map
is dropping the index right now, because one can create a dataset with more or fewer rows using map
(and therefore the index might not be relevant anymore)
I guess we could check the resulting dataset length, and if the user hasn't changed the dataset size we could keep the index, what do you think ?
doing .add_column("x",x_data)
also removes the index. the new column might be irrelevant to the index so I don't think it should drop.
Minimal example
from datasets import load_dataset
import numpy as np
data=load_dataset("ceyda/cats_vs_dogs_sample") #just a test dataset
data=data["train"]
embd_data=data.map(lambda x: {"emb":np.random.uniform(-1,0,50).astype(np.float32)})
embd_data.add_faiss_index(column="emb")
print(embd_data.list_indexes())
embd_data=embd_data.add_column("x",[0]*data.num_rows)
print(embd_data.list_indexes())
I agree add_column
shouldn't drop the index indeed ! Is it something you'd like to contribute ? I think it's just a matter of copying the self._indexes
dictionary to the output dataset
Describe the bug
assigning the resulted dataset to original dataset causes lost of the faiss index
Steps to reproduce the bug
my_dataset
is a regular loaded dataset. It's a part of a customed dataset structurein case something wrong with my
_get_nearest_examples_batch()
, it's like thisExpected results
map
shouldn't drop the indexes, in another word, indexes should be carried to the generated datasetActual results
map drops the indexes
Environment info
datasets
version: 1.18.3