pinecone-io / pinecone-datasets

An open-source dataset library for pre-embedded dataset: create your own data catalog, or use Pinecone's public datasets.
https://pinecone-io.github.io/pinecone-datasets/
32 stars 12 forks source link

Adding support for other Compatible S3 services #30

Closed HendrixString closed 1 year ago

HendrixString commented 1 year ago

Problem

Hello, maintainers (@miararoy @igiloh-pinecone ). My name is Tomer Shalev.

It is highly desirable for organizations to use:

While I was playing with this library, i wanted to use my Cloudflare R2 buckets, but I couldn't because the current code does not support general http endpoints.

I made a small modification and now I can do the following:


fs = get_cloud_fs(
  endpoint=f'https://{ACCOUNT_ID}.r2.cloudflarestorage.com/{BUCKET}', 
  key=key, secret=secret, anon=False,
  config_kwargs={'s3': {'addressing_style': 'path'}}
)

url = f's3://my-pincecone-folder/metadata.json'
with fs.open(url, mode='rb') as f:
    c = f.read()

Solution

Easy modification, simply modify get_cloud_fs to take into consideration http endpoints. This will also help users to use a self hosted minIO server.

Type of Change

Test Plan

This change is not meant to break previous functionality (but it should be tested).

Therefore: