pinecone-io / pinecone-datasets

An open-source dataset library for pre-embedded dataset: create your own data catalog, or use Pinecone's public datasets.
https://pinecone-io.github.io/pinecone-datasets/
32 stars 12 forks source link

[Bug] Unable to load yfcc-10M-filter-euclidean dataset #45

Open yudhiesh opened 6 months ago

yudhiesh commented 6 months ago

Is this a new bug?

Current Behavior

I get the error FileNotFoundError: Dataset does not exist. Please check the path or dataset_id when trying to load the yfcc-10M-filter-euclidean dataset.

Expected Behavior

The dataset should be loaded as its available within list_datasets().

Steps To Reproduce

from pinecone_datasets import list_datasets, load_dataset

datasets = list_datasets()
dataset_name =  "yfcc-10M-filter-euclidean"
assert dataset_name in datasets, "Dataset does not exists!"
dataset = load_dataset(dataset_name)

Relevant log output

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 load_dataset('yfcc-10M-filter-euclidean')

File ~/vector_db_benchmark/venv/lib/python3.10/site-packages/pinecone_datasets/public.py:59, in load_dataset(dataset_id, **kwargs)
     57     raise FileNotFoundError(f"Dataset {dataset_id} not found in catalog")
     58 else:
---> 59     return Dataset.from_catalog(dataset_id, **kwargs)

File ~/vector_db_benchmark/venv/lib/python3.10/site-packages/pinecone_datasets/dataset.py:89, in Dataset.from_catalog(cls, dataset_id, catalog_base_path, **kwargs)
     83 catalog_base_path = (
     84     catalog_base_path
     85     if catalog_base_path
     86     else os.environ.get("DATASETS_CATALOG_BASEPATH", cfg.Storage.endpoint)
     87 )
     88 dataset_path = os.path.join(catalog_base_path, f"{dataset_id}")
---> 89 return cls(dataset_path=dataset_path, **kwargs)

File ~/vector_db_benchmark/venv/lib/python3.10/site-packages/pinecone_datasets/dataset.py:190, in Dataset.__init__(self, dataset_path, **kwargs)
    188     self._dataset_path = dataset_path
    189     if not self._fs.exists(self._dataset_path):
--> 190         raise FileNotFoundError(
    191             "Dataset does not exist. Please check the path or dataset_id"
    192         )
    193 else:
    194     self._fs = None

FileNotFoundError: Dataset does not exist. Please check the path or dataset_id

Environment

- **OS**: macOS 14.4.1
- **Language version**: Python 3.10.10
- **Pinecone client version**: 0.7.0

Additional Context

Looking at the metadata about the datasets

from pinecone_datasets import list_datasets, load_dataset

datasets = list_datasets(as_df=True)
dataset_name =  "yfcc-10M-filter-euclidean"
datasets.query('name == @dataset_name').to_dict()

Results show that the data is not in the bucket:

{'name': {27: 'yfcc-10M-filter-euclidean'},
 'created_at': {27: '2023-08-24 13:51:29.136759'},
 'documents': {27: 10000000},
 'queries': {27: 100000},
 'source': {27: 'big-ann-challenge 2023'},
 'license': {27: None},
 'bucket': {27: None},
 'task': {27: None},
 'dense_model': {27: {'name': 'yfcc', 'tokenizer': None, 'dimension': 192}},
 'sparse_model': {27: None},
 'description': {27: 'Dataset from the 2023 big ann challenge - filter track. Distance: Euclidean. see https://big-ann-benchmarks.com/neurips23.html'},
 'tags': {27: None},
 'args': {27: None}}