Closed miararoy closed 1 year ago
why is this API this way? ie why ds.to_index(name)
instead of index.upload(ds)
, or even better, pinecone.create_index(ds, index_config)
?
ie, what were the tradeoffs made in coming up with this API?
why is this API this way? ie why
ds.to_index(name)
instead ofindex.upload(ds)
, or even better,pinecone.create_index(ds, index_config)
? ie, what were the tradeoffs made in coming up with this API?
@harsha2010
That is part of feedback we got when we presented Client V3 to the Engineering group.
According to the feedback, the client should only expose commands that translate directly to API calls. Any other functionalities (which usually require additional dependencies) should be deferred to pinecone-text
and pinecone-datasets
libraries.
@miararoy maybe a better name would be ds.upsert_to_index()
, since you're not "converting" the ds, but rather upserting its contents.
That is part of feedback we got when we presented Client V3 to the Engineering group.
According to the feedback, the client should only expose commands that translate directly to API calls. Any other functionalities (which usually require additional dependencies) should be deferred topinecone-text
andpinecone-datasets
libraries.
BTW, Client v2 does include an index.upsert_from_dataframe(ds.documents)
method. The decision following that feedback was to deprecate it and instead migrate to ds.upsert_to_index()
in pinecone-datasets
(which is this PR).
@gdj0nes @jhamon is that still the case?
thats still fine right? ie, nothing prevents you from having a pinecone_datasets.create_index(dataset, index_config) in fact, this would be preferred if you wanted to create an index from a dataset right? In fact, this way the dimension is read off the dataset and so is the index type (for example, if the dataset has a sparse column then it has to be a inner product metric index) Otherwise, ds.to_index() can allow for an index to have been created with some data that doesn;t match the configuration of the dataset etc. ie, depending on what you want to achieve, one could be a better API than the other.
i think removing pandas/ numpy/ other dependencies we don;t need on the python client is the right thing to do. my question was more around implicit decisions being made in this API around whether an index is created from a dataset, or whether we are upserting into an existing index. In all cases, I think we;d be better off if we created an index from a dataset rather than upsert into an existing index. Since datasets are usually static and used as a mechanism to either bulk load an index or create a read only index in the first place
right now the to_index
support both upserting to existing and creating new index, do we want to only have the latter?
right now the
to_index
support both upserting to existing and creating new index, do we want to only have the latter?
yes, I don;t see why it should be upserting into an existing index. Isn;t the dataset the full data for an index anyway?
makes sense to me, one argument might be that this library is atm focused on new users, so you may want to show them how to set up an index. but makes sense to drop the 'existing index' path
Problem
There is no simple way to take a pinecone dataset and upsert it into an index
Solution
to_index
functionality that will use Pinecone SDK v3 async batch upsert to load data into pinecone, and additional utilities to help moving data from blob to metadata.Type of Change
Test Plan
Unittests, atm not covering the
to_index
functionality, which was separately tested. also note that it seems like beta environment is unstable