mongodb / chatbot

MongoDB Chatbot Framework. Powered by MongoDB and Atlas Vector Search.
https://mongodb.github.io/chatbot/
Apache License 2.0
144 stars 64 forks source link

Real-time ingest feature #466

Open eric-gardyn opened 2 months ago

eric-gardyn commented 2 months ago

Hi,

Is there a way to use the Ingest package to be more "real-time", API driven? Use case: We have an FAQ which is updated quite often in a CMS. Goal would be to trigger an ingestion of the content on every Create/Update/Delete operation in the CMS.

Is it possible with some little effort?

mongodben commented 2 months ago

hi eric, long time no talk 😄

currently, there is no support for that, though you could write some custom routes/endpoints that wrap the ingest logic in the mongodb-rag-ingest package.

relatedly, we're moving some of that logic to the mongodb-rag-core library soon - https://github.com/mongodb/chatbot/pull/455 (though you'll still be able to consume it from the mongodb-rag-ingest lib)

eric-gardyn commented 2 months ago

FWIW, I now have the 'ingest' running as an endpoint on an Azure function app (serverless function). Just had to tweak the 'loadConfig' method in WithConfig.ts (I am running the repo's Typescript files for 'rag-ingest') to correctly load the config. Otherwise, it works; it even helped me find a "bug" in my config object ;)

Next step is a wrapper code that can take the source of the modified content (in my case, an external CMS) and accordingly call the server-less endpoint.

mongodben commented 2 months ago

FWIW, I now have the 'ingest' running as an endpoint on an Azure function app (serverless function).

nice! just to clarify what you mean, did you created an endpoint that's like POST /ingest to trigger the ingestion process?

did you make separate pages/embed endpoints? are there path parameters to do it by data source, ie POST /ingest/pages/:sourceName?


somewhat related, i think it would be really neat to have embedding occur as an event-based process whenever a page is updated. would be pretty straightforward with MongoDB change streams. you'd just need to build some basic event queue to process the page creation/change/deletion events to take into account rate limit issues with the embedding models.

eric-gardyn commented 2 months ago

yes POST /ingest that takes an array of strings in body's argument. and basically just using withConfig like so:

    const resp = await withConfig(doAllCommand, { doPagesCommand, config, sourceNames })

changed doAllCommand args to

type DoAllCommandArgs = {
  doPagesCommand: typeof standardDoPagesCommand
  sourceNames?: string[]
}

and updated doAllCommand to call

  await doPagesCommand(config, { source: sourceNames })

  await doEmbedCommand(config, {
    since: lastSuccessfulRunDate ?? new Date('2023-01-01'),
    source: sourceNames,
  })

doPagesCommand and doEmbedCommand already took 'source' as string[]

mongodben commented 2 months ago

nice. this is great feedback. i realistically don't think that we'll create an ingest API anytime soon since we don't have need on our end. however, i would like to cleanly expose the ingestion methods so you or others can do something like what you've done w/o having to do anything hacky. like a "MongoDB RAG Ingest SDK".