support for dataset creation with sentence transformers - Githubissues

pinecone-io / pinecone-datasets

An open-source dataset library for pre-embedded dataset: create your own data catalog, or use Pinecone's public datasets.

https://pinecone-io.github.io/pinecone-datasets/

32 stars 12 forks source link

support for dataset creation with sentence transformers #32

Closed HendrixString closed 1 year ago

HendrixString commented 1 year ago

Problem

@miararoy @igiloh-pinecone

It would be nice to do something like:

sentences = [
   "How do I get a replacement Medicare card?",
   "What is the monthly premium for Medicare Part B?"
]

dataset = Dataset.from_sentence_transformers(
    'sentence-transformers/all-MiniLM-L6-v2',
    sentences
)

There are more than 124 embedding models, that work with the Sentence Transformers library

Solution

Creating a simple static interface inside the Dataset class and connect it with the sentence transformers api.

Type of Change

[ ] Bug fix (non-breaking change which fixes an issue)
[-] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update
[ ] Infrastructure change (CI configs, etc)
[ ] Non-code change (docs, etc)
[ ] None of the above: (explain here)

Test Plan

I have added a preliminary test that passes. I would love to add more coverage tests for this