tmc / langchaingo

LangChain for Go, the easiest way to write LLM-based programs in Go
https://tmc.github.io/langchaingo/
MIT License
3.78k stars 523 forks source link

Question around chroma db embeder #764

Open XiaoConstantine opened 2 months ago

XiaoConstantine commented 2 months ago

Might be a stupid question, but when client has a chromadb as:

    store, errNs := chroma.New(
        chroma.WithChromaURL(localURL),
        chroma.WithEmbedder(embeder),
        chroma.WithDistanceFunction("cosine"),
        chroma.WithNameSpace(sessionString),
    )

and eventually:

store.AddDocuments(ctx, docs)

Should I expect the rows in the created collection contain embeddings? It seems to me currently it's None and I don't see embeder being used in AddDocuments function either

Here's the output I got query chromadb's collection

{'ids': ['026fc10f-ee40-4247-97eb-18801ded699c' ....,
 'embeddings': None,
 'metadatas': [{'source': foo....},],
 'documents': ["blah.....]
devalexandre commented 2 months ago

Might be a stupid question, but when client has a chromadb as:

  store, errNs := chroma.New(
      chroma.WithChromaURL(localURL),
      chroma.WithEmbedder(embeder),
      chroma.WithDistanceFunction("cosine"),
      chroma.WithNameSpace(sessionString),
  )

and eventually:

store.AddDocuments(ctx, docs)

Should I expect the rows in the created collection contain embeddings? It seems to me currently it's None and I don't see embeder being used in AddDocuments function either

Here's the output I got query chromadb's collection

{'ids': ['026fc10f-ee40-4247-97eb-18801ded699c' ....,
 'embeddings': None,
 'metadatas': [{'source': foo....},],
 'documents': ["blah.....]

you can see the example chroma-vectorstore-example

XiaoConstantine commented 2 months ago

@devalexandre Here's the collection I have running the example link you pasted:

Out[6]:
{'ids': ['020afcd9-f07a-4e37-b742-013a58ddf722',
  '06868a94-e427-44a9-adff-cec70b00b035',
  '1f35898a-6660-4c21-b7f4-0929e115bb80',
  '2c2c73e9-0f66-4ac8-b8fb-56edda10e05f',
  '5ee1653b-89eb-4bfb-87a0-0d7a18708181',
  '61836373-f98b-426a-b9fb-49ebcce8587d',
  '790ee176-b7e3-4b51-b46f-6594afa1a364',
  '7abda7d1-c9ee-4f56-b56d-f34adc20530f',
  'b019a576-b104-4bc9-864f-47907b3fa0cb',
  'b17b8e05-0662-49dd-8bd4-587ee7c2206d',
  'c4279033-b568-456c-8ca3-d1e7eb7cffbc',
  'd0f0415f-fb22-4f6d-b848-47e6a92eab65',
  'd7884ce3-95a0-4389-8289-5cac5fd4f2d3'],
 'embeddings': None,
 'metadatas': [{'area': 1523,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 22.6},
  {'area': 707,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 0.04},
  {'area': 105,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 11},
  {'area': 341,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 1.59},
  {'area': 1572,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 9.5},
  {'area': 622,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 9.7},
  {'area': 918,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 0.42},
  {'area': 905,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 1.2},
  {'area': 326,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 2.3},
  {'area': 203,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 15.5},
  {'area': 641,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 6.9},
  {'area': 1200,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 13.7},
  {'area': 828,
   'nameSpace': 'ce41f18c-accd-4b98-8165-362f9406a2e0',
   'population': 1.46}],
 'documents': ['Sao Paulo',
  'Kazuno',
  'Paris',
  'Fukuoka',
  'London',
  'Tokyo',
  'Toyota',
  'Hiroshima',
  'Nagoya',
  'Buenos Aires',
  'Santiago',
  'Rio de Janeiro',
  'Kyoto'],
 'data': None,
 'uris': None}

The embedding section is still empty, tho my understanding is that with a provided openai api key, it will create a embedding function with it and generate embeddings based on documents, seems my understanding is wrong here?

CrazyWr commented 2 months ago

@XiaoConstantine you shoule read Chroma docs: https://docs.trychroma.com/usage-guide#adding-data-to-a-collection https://docs.trychroma.com/troubleshooting#using-get-or-query-embeddings-say-none

If Chroma is passed a list of documents, it will automatically tokenize and embed them with the collection's embedding function (the default will be used if none was supplied at collection creation). Chroma will also store the documents themselves. If the documents are too large to embed using the chosen embedding function, an exception will be raised.

Using .get or .query, embeddings say None This is actually not an error. Embeddings are quite large and heavy to send back. Most application don't use the underlying embeddings and so, by default, chroma does not send them back. To send them back: add include=["embeddings", "documents", "metadatas", "distances"] to your query to return all information.

So you should query Chroma collections with include, and ChromaVector.SimilaritySearch function also support WithIncludes option.