Dataset to index - Githubissues

pinecone-io / pinecone-datasets

An open-source dataset library for pre-embedded dataset: create your own data catalog, or use Pinecone's public datasets.

https://pinecone-io.github.io/pinecone-datasets/

32 stars 12 forks source link

Dataset to index #16

Closed miararoy closed 1 year ago

miararoy commented 1 year ago

Problem

There is no simple way to take a pinecone dataset and upsert it into an index

Solution

to_index functionality that will use Pinecone SDK v3 async batch upsert to load data into pinecone, and additional utilities to help moving data from blob to metadata.

Type of Change

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update
[ ] Infrastructure change (CI configs, etc)
[ ] Non-code change (docs, etc)
[ ] None of the above: (explain here)

Test Plan

Unittests, atm not covering the to_index functionality, which was separately tested. also note that it seems like beta environment is unstable

harsha2010 commented 1 year ago

why is this API this way? ie why ds.to_index(name) instead of index.upload(ds), or even better, pinecone.create_index(ds, index_config) ? ie, what were the tradeoffs made in coming up with this API?

igiloh-pinecone commented 1 year ago

why is this API this way? ie why ds.to_index(name) instead of index.upload(ds), or even better, pinecone.create_index(ds, index_config) ? ie, what were the tradeoffs made in coming up with this API?

@harsha2010 That is part of feedback we got when we presented Client V3 to the Engineering group.
According to the feedback, the client should only expose commands that translate directly to API calls. Any other functionalities (which usually require additional dependencies) should be deferred to pinecone-text and pinecone-datasets libraries.

@miararoy maybe a better name would be ds.upsert_to_index(), since you're not "converting" the ds, but rather upserting its contents.

igiloh-pinecone commented 1 year ago

That is part of feedback we got when we presented Client V3 to the Engineering group.
According to the feedback, the client should only expose commands that translate directly to API calls. Any other functionalities (which usually require additional dependencies) should be deferred to pinecone-text and pinecone-datasets libraries.

BTW, Client v2 does include an index.upsert_from_dataframe(ds.documents) method. The decision following that feedback was to deprecate it and instead migrate to ds.upsert_to_index() in pinecone-datasets (which is this PR).
@gdj0nes @jhamon is that still the case?

harsha2010 commented 1 year ago

thats still fine right? ie, nothing prevents you from having a pinecone_datasets.create_index(dataset, index_config) in fact, this would be preferred if you wanted to create an index from a dataset right? In fact, this way the dimension is read off the dataset and so is the index type (for example, if the dataset has a sparse column then it has to be a inner product metric index) Otherwise, ds.to_index() can allow for an index to have been created with some data that doesn;t match the configuration of the dataset etc. ie, depending on what you want to achieve, one could be a better API than the other.

harsha2010 commented 1 year ago

i think removing pandas/ numpy/ other dependencies we don;t need on the python client is the right thing to do. my question was more around implicit decisions being made in this API around whether an index is created from a dataset, or whether we are upserting into an existing index. In all cases, I think we;d be better off if we created an index from a dataset rather than upsert into an existing index. Since datasets are usually static and used as a mechanism to either bulk load an index or create a read only index in the first place

miararoy commented 1 year ago

right now the to_index support both upserting to existing and creating new index, do we want to only have the latter?

harsha2010 commented 1 year ago

right now the to_index support both upserting to existing and creating new index, do we want to only have the latter?

yes, I don;t see why it should be upserting into an existing index. Isn;t the dataset the full data for an index anyway?

miararoy commented 1 year ago

makes sense to me, one argument might be that this library is atm focused on new users, so you may want to show them how to set up an index. but makes sense to drop the 'existing index' path