Helper utilities to use Custom Embeddings

Glavin001 commented 1 year ago

Problem

The default embeddings (e.g. Ada-002 from OpenAI, etc) are great generalists. However, they are not tailored for your specific use-case.

Use Case

Write a full-stack TypeScript application which uses custom embeddings and could support Edge Runtimes with limited features (1, 2)

Proposed Solution

🎉 Customizing Embeddings!

ℹ️ See my tutorial / lessons learned if you're interested in learning more, step-by-step, with screenshots and tips.

How it works

Embedding

flowchart LR

    subgraph "Similarity Search"
        direction LR

        CustomMatrix["Custom Embedding Matrix\n(e.g. custom-embedding.npy)"]
        Multiply["(Original Embedding) x (Matrix)"]
        CustomMatrix --> Multiply

        Text1["Original Texts #1, #2, #3..."]
        Raw1'["Original Embeddings #1, #2, #3, ..."]
        Custom1["Custom Embeddings #1, #2, #3, ..."]
        Text1-->Raw1'
        Raw1' --> Multiply
        Multiply --> Custom1

        DB["Vector Database"]
        Custom1 -->|Upsert| DB

        Search["Search Query"]
        EmbedSearch["Original Embedding for Search Query"]
        CustomEmbedSearch["Custom Embedding for Search Query"]

        Search-->EmbedSearch
        EmbedSearch-->Multiply
        Multiply-->CustomEmbedSearch

        SimilarFound["Similar Embeddings Found"]
        CustomEmbedSearch -->|Search| DB
        DB-->|Search Results|SimilarFound
    end

Out of scope

Training the custom embedding matrix, since it would require many other Python libraries. Pre-train using Lanchain and store the matrix as a Numpy .npy file.

Example

import { CustomizeEmbeddings, OpenAIEmbeddings } from "langchain/embeddings";

/* === Generalized Embeddings === */
/* Embed queries */
const embeddings = new OpenAIEmbeddings();
const res = await embeddings.embedQuery("Hello world");

/* Embed documents */
const documentRes = await embeddings.embedDocuments(["Hello world", "Bye bye"]);

/* === Training Customized Embeddings === */
// Not supported in JavaScript/TypeScript. Use Langchain for Python: https://github.com/hwchase17/langchain/issues/1260

/* === Loading Customized Embeddings === */
const customEmbeddings = new CustomizeEmbeddings(embeddings);
customEmbeddings.load("custom-embedding.npy")

/* Embed queries */
const customRes = await customEmbeddings.embedQuery("Hello world");

/* Embed documents */
const customDocumentRes = await customEmbeddings.embedDocuments(["Hello world", "Bye bye"]);

Recommended Reading

P.S. I'd love to personally contribute this to the Langchain repo and community! Please let me know if you think it is a valuable idea and any feedback on the proposed solution. Thank you!

rikuthinks commented 1 year ago

Thanks for bringing this up as an issue! +1

This would be an extremely valuable contribution to optimize for domain-specific use cases at scale, which is important when you're planning to have a high volume of queries. I would personally use this.

Here's one startup's feedback on why they decided to use custom embeddings: https://www.buildt.ai/blog/viral-ripout

Glavin001 commented 1 year ago

Thanks, @rikuthinks ! Yes, I also highly recommend the Built blog post by @ Pullerz, I had included it in Recommended Reading above along with others I wrote/read -- check them out.

I'm hoping to find some time this or next week to contribute a Pull Request 🤞

dosubot[bot] commented 1 year ago

Hi, @Glavin001! I'm here to help the LangChain team manage their backlog and I wanted to let you know that we are marking this issue as stale.

From what I understand, you opened this issue to discuss the need for helper utilities to use custom embeddings in a TypeScript application. The proposed solution involves customizing embeddings by multiplying the original embeddings with a custom matrix. It's great to see that rikuthinks and batmanscode have expressed their support for this idea, with rikuthinks mentioning the value of optimizing for domain-specific use cases. You thanked rikuthinks and mentioned your plan to contribute a pull request soon.

Before we mark this issue as stale, we wanted to check with you if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution and we look forward to hearing from you soon!

Best, LangChain Team

langchain-ai / langchainjs