pinecone-io / pinecone-datasets

An open-source dataset library for pre-embedded dataset: create your own data catalog, or use Pinecone's public datasets.
https://pinecone-io.github.io/pinecone-datasets/
32 stars 12 forks source link

[Bug] HttpError : Invalid bucket name: 'wikipedia-simple-text-embedding-ada-002-100K', 400 #35

Open David-GERARD opened 1 year ago

David-GERARD commented 1 year ago

Is this a new bug?

Current Behavior

Hi,

I have used code from one of the example colab Notebook on RAG with langchain to make a lab for students on vector databases.

A minority of the students encountered the following error when importing the wikipedia-simple-text-embedding-ada-002-100K dataset from pinecone_datasets: image image (1) image (2)

Expected Behavior

This cell is supposed to run and import the dataset (it works on my laptop and for most of the students).

Steps To Reproduce

In python 3.11 with the packages versions described later run pinecone_datasets.load_dataset('wikipedia-simple-text-embedding-ada-002-100K ')

Relevant log output

No response

Environment

- **OS**: multiple (Windows and MacOS)
- **Language version**: python 3.11
- **Pinecone client version**: pinecone_datasets==0.6.2

Additional Context

None of our troubleshooting attempts worked, and we have not identifier the common denominator that leads to this error happening. When using the list_datasets() method, the wikipedia-simple-text-embedding-ada-002-100K appears in the list, and we were thinking it might be a server side error.

martinohanlon commented 11 months ago

I have experienced the same issue.

This relates to https://community.pinecone.io/t/pinecone-datasets-httperror-invalid-bucket-name-wikipedia-simple-text-embedding-ada-002-100k-400/3715/3 .

Root cause is that the code is using os.path.join to create a gs file path and on Windows you get you get a \ e.g.

gs://catalog_base_path\dataset_id

The "dirty" fix is to modify this line of code https://github.com/pinecone-io/pinecone-datasets/blob/main/pinecone_datasets/dataset.py#L95

To

dataset_path = f"{catalog_base_path}/{dataset_id}"

But that wont work when the catalog_base_path is a local path.

David-GERARD commented 11 months ago

Thanks @martinohanlon !

martinohanlon commented 11 months ago

@David-GERARD I dont think the issue should be close. It is a bug which should be fixed imo.

captainkapnap commented 10 months ago

@martinohanlon your solution worked however another error pops up afterwards.

C:\Users\xxx\AppData\Roaming\Python\Python311\site-packages\pinecone_datasets\dataset.py:280: UserWarning: WARNING: No data found at: gs://pinecone-datasets-dev/youtube-transcripts-text-embedding-ada-002/documents/*.parquet. Returning empty DF warnings.warn(

Code in local Jupyter Notebook (Win10):

from pinecone_datasets import load_dataset, list_datasets
list_datasets()
dataset = load_dataset('youtube-transcripts-text-embedding-ada-002')
dataset.head()

^--- modified from: https://docs.pinecone.io/docs/using-public-datasets

Exact code worked in Google colab notebook (@David-GERARD fyi)

pdebuyer commented 7 months ago

Hey. The dirtiest solution is to patch os.path.join at the beginning of datasets.py os.path.join = lambda *s: "/".join(s) This should fix your issue @captainkapnap