Refactor Snooty/Docs data sources to config file and import the data.

jhyearsley commented 1 year ago

Moved the static JSON to a config file. I don't think is the perfect solution but I think it's an incremental step towards the right solution.

jhyearsley commented 1 year ago

@mongodben interesting point you make about not being convinced about the ingest refactor being the optimal design, I have some similar hazy intuition. It sounds like with the design we're discussing adding a new source would require

adding new source to projectSources.ts
adding the sources to ingest.config.js
adding the new config file e.g. customer-success.ingest.config.js

It does seem like we should be able to do it in less steps

cbush commented 1 year ago

Here's how I see it going:

We implement plugin loader a la Bluehawk
You npm install ingest as a dependency, it provides the CLI too
Implement your data source(s) in the plugin
CLI gets data sources from the plugin

Will need to flesh it out but this is my basic thinking

mongodben commented 1 year ago

Here's how I see it going:

We implement plugin loader a la Bluehawk

makes sense to me that other data sources can be plugins

You npm install ingest as a dependency, it provides the CLI too

👍

Implement your data source(s) in the plugin

imo, the plugin should provide a data source that can than be added via a config file.

this should be done in a similar approach to JS ecosystem tools/frameworks like:

we can also add other things to this config file (things like chunk size, pre-processor, etc)

example:

// ingest.config.js
import { makeSourceFunc } from "data-source-package";
import { makeSource2Func } from "./local-data-source";

const sources = {
"source1": async () => makeSourceFunc(),
"source2": async () => makeSource2Func(),
// ...
};

export default { sources, /* ...any other config we add /* }

CLI gets data sources from the plugin

see above for how i think CLI should get data source from plugin

cbush commented 1 year ago

That looks great - config.js is then actually responsible for "loading plugins" rather than the CLI. CLI just looks for a config.js. Perfect!

jhyearsley commented 1 year ago

@mongodben you mentioned

we can also add other things to this config file (things like chunk size, pre-processor, etc)

In that case would you expect that the other things would be shared across all data sources? I'd think they should be configurable per data source.

I'm not familiar with Bluehawk and am a bit hazy on when the "npm install ingest" part would happen... is the idea that a project not currently in the monorepo could install the ingest tool and configure for their own project? I feel like there are two problems we are talking about

changing the current structure of the monorepo
and making it easy for other people to independently use / customize the ingest library

I think the problems are def related but 2 is what adds most of the "what-ifs" complexity in my mind

cbush commented 1 year ago

Yes, the idea is that any project could add their own data sources without actually being a part of the monorepo. But we're not talking about changing the structure of the monorepo. Lerna allows us to publish the ingest package independently. (See for example how we do it for the UI: https://github.com/mongodb/chatbot/blob/main/package.json#L20)

In effect, ingest is just a tool you can use. It comes with some "standard" data sources and you can add your own custom ones. You can also configure it for your use case.

jhyearsley commented 1 year ago

@cbush that makes sense, thanks for clarifying. And by "changing the structure of the monorepo" I just meant adding new config files and refactoring how the data sources are stored / accessed.

The complexity I'm thinking of in the context of extending so other projects can use the tool is having a schema that is standardized but also flexible (e.g. maybe some sources are in a database and some sources are scraped from the web). The reason I'm focusing on the schema is because I'm trying to avoid breaking the existing types in the project and it looks to me like there are some assumptions on injecting data into the data source config e.g. baseUrl with Snooty but not with Dev Center. This is related to my comment on the PR about importing the LocallySpecifiedSnootyProjectConfig

mongodben commented 1 year ago

it looks to me like there are some assumptions on injecting data into the data source config e.g. baseUrl with Snooty but not with Dev Center

these aspects are data source specific, so that's ok. different sources have different needs. for example, 1 might need us to provide baseUrl, while another may not.

you can think of DataSource as an abstract interface that is constructed with some arbitrary configuration, and then has the following shape:

/**
  Represents a source of page data.
 */
export type DataSource = {
  /**
    The unique name among registered data sources.
   */
  name: string;

  /**
    Fetches pages in the data source.
   */
  fetchPages(): Promise<Page[]>;
};

in our project we use makeDataSourceType(sourceSpecificConfig) functions to construct our DataSource instances, but you could also use a class to the same effect, such as new DataSourceType(sourceSpecificConfig).

if you look at the latest version of the code on main, you can see some more code examples of different DataSource types.

cbush commented 1 year ago

As Ben said, the DataSource interface is actually extremely flexible. You just need to provide "pages" from the source. How that works exactly depends on your specific use case. We're providing builders like a GitDataSource that fetches a repo and does something, but you can start from scratch.

The major limitation that I am very conscious of and would probably like to address post-launch is that a data source has to return all pages at once, which means an entire data source has to be loaded into memory at once. We can change the interface to allow returning an async iterator so that HUGE data sources can return one page at a time.

mongodben commented 1 year ago

a couple final thoughts on this topic before i head off for wedding/honeymoon for the next few weeks. trust y'all to take it from here!

project structure

for project structure in the repo we can do like:

.
|_ ingest
|  |_ all the general CLI stuff used by other projects. exports CLI executable and also 
|     typescript types/some data sources. slight refactor of what we have there now
|_ docs-chatbot-ingest (more on this below)
|  |_ ingest.config.js // project specific config
|  |_ package.json 
|  |_ src/
|     |_ LocalyDefinedDataSource.ts
|  |_ build/ 
|_ ...other projects that use ingest service. customer success, TSE, etc.

`docs-chatbot-ingest` package

ingest.config.js

import { makeSnootyDataSource } from 'ingest';
import { makeLocallyDefinedDataSource } from './build/LocalyDefinedDataSource';
import { makeNpmPackagedDataSource } from 'npm-packaged-data-source' // we don't have any plans for publishing data sources to npm, but it'd be possible given this architecture

export default {
  dataSources: {
    "snooty-docs": () => makeSnootyDataSource(/*some specific config */),
    "other-source": () => makeLocalyDefinedDataSource(/*some specific config */),
    "other-source2": () => makeNpmPackagedDataSource(/*some specific config */),
  }
}

package.json

{
  // other config
  "scripts": {
    "ingest-data": "ingest all", 
    // other scripts
  },
  "dependencies": {
    "ingest": "file:../ingest"
  }
}

mongodb / chatbot