The scope of this repo is far beyond than it can be imagined

This is an amazing repo, I got so exited to see this. Because I have been thinking about similar ideas lately. I took a quick look into the repo, and it seems like the main aim (for now) is to support scikit learn datasets and make it some how fused with gpt prompts and make gpt models to do the work. But I think we there are more scopes to do just this.

Since scikit learn has been a huge ecosystem for ML and it is mostly used (till now and will be) by most of the organizations. When it comes to tabular data (100M row +) I do not think scikit LLM can work. But till now there are several real world problems that has this kind of tabular data with different varying patterns. And those kind of problems might not be solved with 'just' LLM.

Also at least with scikit learn models are much more LESS black box then these models are and hence easily interpretable where as these black box are far less interpretable and we might not be able to proof locally that why our models is generating this behavior for some data. Also these models are not deterministic, just change one small part of the input and the whole black box can turn into a different direction.

However, think of like this. Let's think in terms of BFF (Backend for Frontend) design, approach instead making GPT as the backend of computation, make it the front end part of the computation. Provide the dataset link, give it the 'sample dataset', provide the problem statement, give the meta data and using langchain like tools, and existing awesome ecosystem of scikit learn, we can tell these models to do the computation on bare scikit learn and then come up with the predictions and even for example, if something is showing an anomaly in terms of behaviour, we can use explainable ai like LIME/SHAFT and use GPT in the top of these and may be generate awesome interpretable reports with these local 'fit' curves / graphs. In this way we can automate lot of process, keeping the reliability factor in check.

And then this can be used and 'deployed' in real world systems because the heavy lifting is still done by scikit learn, but with a front end of gpt. It then all boils down to fitting the right information and right instructions into right place to provide the results we want.

Some examples.

I provide a dataset link, the statement, metadata, it makes the ml model stores some where and provides the training report for each stage of ML training.

Then if I just provide a new data like, I have an user (not present in training data) with these 'unseen' feature what will be the prediction for that user. On the backend it might run prediction pipeline and then we can provide lot of follow ups like

why you predicted this
What if the input was this and how the output would have changed

Even systems which use real time ML can also incorporate, because real time ML is highly dependent on interpretable and light weight models as Speed and reliability both are indepedent. Using gpt as a query engine interface on top of it can be used for enhanced telemetry or something else, like automatically generating the use behavior from data drift or something else. All we might have to query with natural language.

In that way we are not using huge amount of tokens, can provide lesser black box results and also an use case of safe and less hallucinating AI. Please provide me your thought in general, I know this description got really big, but let me know, I am always up to discuss more on this, if my thought is aligned with yours.

Thanks

Hi @Anindyadeep

Thank you for your suggestion. It looks very interesting, but I have to admit this goes well beyond the scope of scikit-llm. First of all, scikit-llm does not (and is not meant to) support arbitrary tabular datasets. Instead we are mainly focusing on text classification/transformation.

From your description it seems like you’re suggesting something more similar to scikit-learn version of pandas-ai. We had some thoughts in this direction, but discarded them for 2 reasons: 1) as already mentioned, this goes beyond the scope; 2) these kinds of pipeline are more prone to failures and much more difficult to maintain, especially when the APIs of the libraries change and the LLM is unaware of the changes. For example, this is the primary reason pandas-ai is unable to support pandas>=2.0.0 for now.

In general, we are open to the idea of extending the scope of scikit-llm, but maybe there are other more suitable places for implementing such a functionality. In addition to scikit-llm we have 2 other projects:

Falcon AutoML: this was meant as a general ML library that automates the pipeline from data ingestion to producing a production-ready model in ONNX format. Falcon can be extended with both custom training pipelines (e.g. LLM guided pipeline) as well as custom integrations (e.g. LLM based explainability tool). I have to admit that we did not have sufficient amount of time to work on this library, so for now only a skeleton with the most basic functionality is implemented, but we have lots of updates planned (hopefully before Christmas). Overall, the scope of the library is virtually unrestricted.
AgentDingo: an extremely small library for building GPT based agents. We are not planning to add too many things to the “core” as we want to keep it both minimalistic and unopinionated. However, we can still add lots of extra functionality in addition to the core submodule.

Hence, I think for the functionality you are describing it might make sense to leverage the other libraries a bit more while not changing the original scope of scikit-llm too much. Also, if we decide to implement something in this direction, I would not want to let the LLM generate any code for direct execution (both for the reasons mentioned in the second paragraph and potential security threats if the library is not used locally but let’s say integrated into a web app), but rather come up with some more abstract grammar that could be used, which is a challenging task on its own.

As this might be relevant beyond scikit-llm, I suggest to continue further discussions of this topic in discord.

iryna-kondr / scikit-llm

The scope of this repo is far beyond than it can be imagined #44

Some examples.