Closed jhyearsley closed 1 year ago
@mongodben interesting point you make about not being convinced about the ingest refactor being the optimal design, I have some similar hazy intuition. It sounds like with the design we're discussing adding a new source would require
projectSources.ts
ingest.config.js
customer-success.ingest.config.js
It does seem like we should be able to do it in less steps
Here's how I see it going:
Will need to flesh it out but this is my basic thinking
Here's how I see it going:
- We implement plugin loader a la Bluehawk
makes sense to me that other data sources can be plugins
- You npm install ingest as a dependency, it provides the CLI too
👍
- Implement your data source(s) in the plugin
imo, the plugin should provide a data source that can than be added via a config file.
this should be done in a similar approach to JS ecosystem tools/frameworks like:
eslint.config.js
(new version)gatsby-config.js
next.config.js
we can also add other things to this config file (things like chunk size, pre-processor, etc)
example:
// ingest.config.js
import { makeSourceFunc } from "data-source-package";
import { makeSource2Func } from "./local-data-source";
const sources = {
"source1": async () => makeSourceFunc(),
"source2": async () => makeSource2Func(),
// ...
};
export default { sources, /* ...any other config we add /* }
- CLI gets data sources from the plugin
see above for how i think CLI should get data source from plugin
That looks great - config.js is then actually responsible for "loading plugins" rather than the CLI. CLI just looks for a config.js. Perfect!
@mongodben you mentioned
we can also add other things to this config file (things like chunk size, pre-processor, etc)
In that case would you expect that the other things would be shared across all data sources? I'd think they should be configurable per data source.
I'm not familiar with Bluehawk and am a bit hazy on when the "npm install ingest" part would happen... is the idea that a project not currently in the monorepo could install the ingest tool and configure for their own project? I feel like there are two problems we are talking about
I think the problems are def related but 2 is what adds most of the "what-ifs" complexity in my mind
Yes, the idea is that any project could add their own data sources without actually being a part of the monorepo. But we're not talking about changing the structure of the monorepo. Lerna allows us to publish the ingest package independently. (See for example how we do it for the UI: https://github.com/mongodb/chatbot/blob/main/package.json#L20)
In effect, ingest is just a tool you can use. It comes with some "standard" data sources and you can add your own custom ones. You can also configure it for your use case.
@cbush that makes sense, thanks for clarifying. And by "changing the structure of the monorepo" I just meant adding new config files and refactoring how the data sources are stored / accessed.
The complexity I'm thinking of in the context of extending so other projects can use the tool is having a schema that is standardized but also flexible (e.g. maybe some sources are in a database and some sources are scraped from the web). The reason I'm focusing on the schema is because I'm trying to avoid breaking the existing types in the project and it looks to me like there are some assumptions on injecting data into the data source config e.g. baseUrl
with Snooty but not with Dev Center. This is related to my comment on the PR about importing the LocallySpecifiedSnootyProjectConfig
it looks to me like there are some assumptions on injecting data into the data source config e.g. baseUrl with Snooty but not with Dev Center
these aspects are data source specific, so that's ok. different sources have different needs. for example, 1 might need us to provide baseUrl, while another may not.
you can think of DataSource
as an abstract interface that is constructed with some arbitrary configuration, and then has the following shape:
/**
Represents a source of page data.
*/
export type DataSource = {
/**
The unique name among registered data sources.
*/
name: string;
/**
Fetches pages in the data source.
*/
fetchPages(): Promise<Page[]>;
};
in our project we use makeDataSourceType(sourceSpecificConfig)
functions to construct our DataSource
instances, but you could also use a class to the same effect, such as new DataSourceType(sourceSpecificConfig)
.
if you look at the latest version of the code on main
, you can see some more code examples of different DataSource
types.
As Ben said, the DataSource interface is actually extremely flexible. You just need to provide "pages" from the source. How that works exactly depends on your specific use case. We're providing builders like a GitDataSource that fetches a repo and does something, but you can start from scratch.
The major limitation that I am very conscious of and would probably like to address post-launch is that a data source has to return all pages at once, which means an entire data source has to be loaded into memory at once. We can change the interface to allow returning an async iterator so that HUGE data sources can return one page at a time.
a couple final thoughts on this topic before i head off for wedding/honeymoon for the next few weeks. trust y'all to take it from here!
for project structure in the repo we can do like:
.
|_ ingest
| |_ all the general CLI stuff used by other projects. exports CLI executable and also
| typescript types/some data sources. slight refactor of what we have there now
|_ docs-chatbot-ingest (more on this below)
| |_ ingest.config.js // project specific config
| |_ package.json
| |_ src/
| |_ LocalyDefinedDataSource.ts
| |_ build/
|_ ...other projects that use ingest service. customer success, TSE, etc.
docs-chatbot-ingest
packageingest.config.js
import { makeSnootyDataSource } from 'ingest';
import { makeLocallyDefinedDataSource } from './build/LocalyDefinedDataSource';
import { makeNpmPackagedDataSource } from 'npm-packaged-data-source' // we don't have any plans for publishing data sources to npm, but it'd be possible given this architecture
export default {
dataSources: {
"snooty-docs": () => makeSnootyDataSource(/*some specific config */),
"other-source": () => makeLocalyDefinedDataSource(/*some specific config */),
"other-source2": () => makeNpmPackagedDataSource(/*some specific config */),
}
}
package.json
{
// other config
"scripts": {
"ingest-data": "ingest all",
// other scripts
},
"dependencies": {
"ingest": "file:../ingest"
}
}
Moved the static JSON to a config file. I don't think is the perfect solution but I think it's an incremental step towards the right solution.