robertvacareanu / llm4regression

Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update
105 stars 14 forks source link

Would it be also possible to use a LLM for doing Factor Analysis (FA) instead of Regression ? #2

Open jbdatascience opened 2 months ago

jbdatascience commented 2 months ago

After seeing your use of LLMs for doing regression, I can not stop wondering about this question:

Would it be also possible to use a LLM for doing Factor Analysis (FA) instead of Regression ?

Of course Factor Analysis is a dimensionality reduction algorithm and not a regression algorithm, but I do not see a particular reason why FA could not be used instead of regression in your particular setup in conjunction with a LLM !

So my question is: is this possible? And could the performance even be better than traditional FA algorithms?

robertvacareanu commented 2 months ago

Hi, thanks for reaching out!

Interesting question! I haven't thought about it until your message.

One issue I can foresee would be that of context, as you would need to give a (input, output) pairs where the input is a matrix instead of a vector.

I briefly tried today to see if LLMs could predict the largest eigenvalue for a given matrix. I created random matrices (5x5) with given (random) eigenvalues, then gave example in-context to Claude 3 Opus. I only gave 20 examples as input and asked to predict the output for the 21st example. It did not look terrible but not extremely good either. I only tried it with 5 seeds.

Below are some results:

Eigenvalues of the input matrices: [4.42, 7.18, 8.94, 10.0, 7.41, 7.41, 7.36, 5.63, 7.28, 9.26, 9.74, 9.12, 7.41, 5.6, 6.34, 9.03, 8.89, 8.89, 6.41] Claude 3 Opus predicted: 8.89 Gold is: 5.77

Eigenvalues of the input matrices: [3.53, 9.01, 2.17, 6.7, 9.78, 5.67, 9.5, 7.83, 8.53, -0.1, 3.51, 6.46, 6.94, 0.37, 9.44, 9.68, 1.32, 1.44, 8.48] Claude 3 Opus predicted: 4.01 Gold is: 3.31

Eigenvalues of the input matrices: [9.3, 6.92, 1.08, 1.91, -3.94, 3.58, 7.18, 2.22, 6.19, 8.55, 8.33, 3.69, 2.86, 7.2, 8.59, 6.38, 8.39, 9.31, 6.41] Claude 3 Opus predicted: 6.64 Gold is: 8.16

Eigenvalues of the input matrices: [8.66, 5.57, 6.55, 9.81, 3.08, 5.09, 9.43, 7.92, 1.29, 9.71, -3.34, 7.71, 9.24, 4.39, 7.73, 8.2, 9.41, 5.62, 6.51] Claude 3 Opus predicted: 6.51 Gold is: 6.89



(Note: This was just a preliminary test)

Regarding whether it can perform better than traditional FA algorithms, I think it is hard to say. 

LLMs seem very good at finding the underlying pattern, so maybe there is a way to use it, directly or indirectly.

Do you have a specific use-case in mind where you would like to apply FA?
jbdatascience commented 2 months ago

I have no particular use-case in mind, but I will search for one, that hopefully makes it possible to make comparisons with traditional FA algorithms.

The criterion for deciding which FA algorithm is the best would be: highest percentage explained variance.

Perhaps we can construct a synthetic data set for which we know the ground truth Factor Analysis result?

An often used existing dataset for doing FA on is the Boston Housing set: https://www.kaggle.com/datasets/altavish/boston-housing-dataset Better not use the “chas” variable in this dataset, because it is highly imbalanced! There are many FA results on this dataset known, so we can compare those with potentially better algorithms!

A question about your examples, for example the first one:

Eigenvalues of the input matrices: [4.31, 4.41, 0.99, 7.87, 9.45, 8.37, 7.86, 9.56, 9.37, 0.04, 5.42, 4.49, 4.8, 9.45, 7.41, 6.98, 1.01, 5.74, 7.58, 5.22] Claude 3 Opus predicted: 9.39 Gold is: 7.95

When I look at that list of Eigenvalues, the largest one has value: 9.56 But then I do not understand the Gold value of 7.95 ! Am I missing something here?