pinecone-io / pinecone-datasets

An open-source dataset library for pre-embedded dataset: create your own data catalog, or use Pinecone's public datasets.
https://pinecone-io.github.io/pinecone-datasets/
32 stars 12 forks source link

support for dataset creation with sentence transformers #32

Closed HendrixString closed 1 year ago

HendrixString commented 1 year ago

Problem

@miararoy @igiloh-pinecone

It would be nice to do something like:

sentences = [
   "How do I get a replacement Medicare card?",
   "What is the monthly premium for Medicare Part B?"
]

dataset = Dataset.from_sentence_transformers(
    'sentence-transformers/all-MiniLM-L6-v2',
    sentences
)

There are more than 124 embedding models, that work with the Sentence Transformers library

Solution

Creating a simple static interface inside the Dataset class and connect it with the sentence transformers api.

Type of Change

Test Plan

I have added a preliminary test that passes. I would love to add more coverage tests for this