llm-tools / embedJs

A NodeJS RAG framework to easily work with LLMs and embeddings
https://llm-tools.mintlify.app/get-started/introduction
Apache License 2.0
334 stars 40 forks source link

Better way to manage depenedencies? #57

Closed adhityan closed 1 month ago

adhityan commented 6 months ago

I have been considering this for a while now. The more loaders, LLMs and embedding models we add, the question becomes more and more important. Currently the library already separates out the dependencies for the vector databases using clever package.json config, leveraging peer and optional dependencies.

But this does not scale well but is not a problem today. At the same time, it's getting important that we really break out the various parts even more. I don't want a lot of dependencies being added in by default. It's also problematic to address vulnerabilities. Ultimately people will only use some of the choices and they should only have those relevant modules.

Here are our options -

  1. Continue with the current package.json based solution. We will have more optional dependencies and we will address this by more detailed documentation.

  2. Switch to a monorepo. A core package (embedJs) and several sub-packages (like embedjs-openai, embedjs-slack-loader, etc)

Clearly option 2 is the long term direction but might be more work than necessaty today.

What do you all think?

parzival418 commented 6 months ago

I can't recommend NX enough, and getting started with it incrementally is faster than you may think. Their ecosystem enables you to spin up new packages and dependencies rapidly and easily, breaking down your architecture into easily manageable small pieces.

I would be interested in contributing to help this endeavour. I am looking to use EmbedJS right now for my application and migrating off from a kind of hacky use of Embedchain running with JsPyBridge. There are some functionalities I need from EmbedJS which I will have to add and am happy to contribute.

adhityan commented 6 months ago

Yes, I have been considering switching to NX primarily. My only concern is asking people to install even more packages per loader. But this removes the need to install 3rd party packages (like for vector databases) as those dependencies will be packaged into the corresponding sub-package published.

I haven't used NX before (last time had to deal with node monorepos, it was the era of lerna) but let me take a first stab at it. You can review the PR and let me know if there are best practises that I missed.

Do let me know what missing functionalities will be useful for you?

parzival418 commented 6 months ago

Sounds good. Make heavy use of generators for all the packages, as they make it dead simple to spin up new ones with all configs working. Also the VS Code NX extension is phenomenal. I switched over to NX a few years ago after having used Lerna for a while. I discovered NX because Nrwl took over managament of Lerna. It is excellent.

As to functionality, there are a couple things. Searching and querying via metadata filters is huge. I also tend to use this kind of abstraction a bit like a vector ORM, so being able to raw search and get all the documents back is huge rather than running through the query method and having it use the internal LLM.

Also, auto-detecting the file format for adding would be huge. Embedchain in Python has some good code there which could easily be ported to TS using an LLM.

Tracking the individual data sources and storing them would be great too. Embedchain keeps a running DB of the files and data sources so you can query for the files, or for the documents.

adhityan commented 6 months ago

I came across NX from their takeover of lerna as well. However, moving this lib to NX will take some time as I am not familiar with NX at the moment. I do think the monorepo route is the long term direction to go in but it will come up after some core functional items are added in.

Some of the functionalities you mention are aligned with the roadmap I have.

I am currently adding support for auto-detecting files and URLs to auto load using relevant loaders. So in principle, you should be able to pass a path/URL/a json (etc) and the library should be able to auto load the content identifying the correct loader. You should also be able to pass a directory path and load all files in it.

This is being added in this PR https://github.com/llm-tools/embedJs/pull/66 and I hope to get it out over the next few days after more testing.

The next major feature to add in is going to be searching via metadata.

Could you tell me more about the last feature? Is that just a list of files / data sources that were added so far to the vector database?

davetaz commented 5 months ago

All for moving to monorepo.

(Bias declaration: My only experience with NX on another project is people gave up.)

Do what you know as you are the primary maintainer.

I'm currently working on a project which is heavily using embedJS and will be contributing back and knowing how to do this best would be great. e.g. we add loaders, embedders.

Currently the biggest thing we have added is an abstraction of the whole conversations layer so conversaiton history can also be stored in memory/database so it persists. So far to do this has required changes across the intialiser and base-model, so this wouldn't be as easy to split.

adhityan commented 5 months ago

Yes; I explored nx past few days. I am begining to feel turborepo is easier to get started with for this project compared to nx and they have similar feature sets. Likely will move in that direction.

You can contribute in a number of ways -

  1. If you add a new loader / embedder that you want to contribute back - you can send in a PR for that. Contributing back helps the community but also helps you get bugfixes and support for your code.
  2. If you find bugs somewhere, you can contribute by creating an issue with details that will help identify and address the issue. You can also send in a PR for a bugfix and get it merged upstream on priority.
  3. You can also suggest to me new features (like allowing conversation history to be stored in memory/database). We will need to scope these out a bit more to make sure they are broadly useful. Post this, the community (me and anyone else including you) can create PRs to add support for that. This gets us in built support for your needs right in the library.
adhityan commented 1 month ago

The library has now switched over to NX. Closing this thread.