langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
11.79k stars 1.97k forks source link

Why is `id` not a field on the `Document` interfaces? #4395

Open omikader opened 4 months ago

omikader commented 4 months ago

I imagine most vectorstore providers have the concept of a unique identifier to represent a particular vector in the database. Given that, should that identifier explicitly be part of the definition for Document, e.g.

export interface DocumentInterface<
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  Metadata extends Record<string, any> = Record<string, any>
> {
  // Proposing we add this
  id: string;

  pageContent: string;

  metadata: Metadata;
}

This would make manipulating vectors by ID (e.g. update, delete) that were retrieved using the similaritySearch methods much simpler because we wouldn't need to index id as a separate metadata attribute

const results = await pineconeStore.similaritySearch(query, 2);

// Would be great if this worked
await pineconeStore.delete({ ids: results.map((document) => document.id) });

// But instead we have to index the document ID as a separate metadata attribute and do this
await pineconeStore.delete({ id: results.map((document) => document.metadata.id });

And in some cases, id is autogenerated by Langchain which means it is unavailable on metadata making having it available on retrieval more important

Linking this similar issue that was closed before: https://github.com/langchain-ai/langchainjs/issues/2704

dosubot[bot] commented 4 months ago

🤖

That sounds like a valuable addition to the Document interface. It's great to see you taking the initiative to propose a solution. Adding the id field to the Document interface would indeed simplify the manipulation of vectors by ID. Your example clearly illustrates the potential benefits of this change. I encourage you to go ahead and create a pull request with your proposed changes. Your contribution will be appreciated by the community. Thank you for your initiative!

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
jacoblee93 commented 4 months ago

This has come up a few times - the issue is not every provider implements the concept of an id. We've got plans to refactor things around retrieval though (possibly introducing a new "Indexing" abstraction) and I would be in favor of adding it.

dosubot[bot] commented 1 month ago

Hi, @omikader,

I'm helping the langchainjs team manage their backlog and am marking this issue as stale. From what I understand, you proposed adding the id field to the Document interface to simplify vector manipulation by ID. I encouraged you to create a pull request with the proposed changes, and jacoblee93 expressed support for the addition, mentioning plans to refactor retrieval and introducing a new "Indexing" abstraction.

Could you please confirm if this issue is still relevant to the latest version of the langchainjs repository? If it is, please let the langchainjs team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

omikader commented 1 month ago

Yes, I think this issue is still relevant 🙂

jacoblee93 commented 1 month ago

We are planning a refactor of the retrieval interfaces after 0.2!