pjaselin / Cubist

A Python package for fitting Quinlan's Cubist regression model
GNU General Public License v3.0
44 stars 3 forks source link

Inconsistency `model.rules_` and `model.coeff_` output #155

Open Paulnkk opened 1 month ago

Paulnkk commented 1 month ago

Hey,

I was generating samples and train Cubist() based on a 20-dimensional Rosenbrock function. Here I encountered the case that model.rules_ returned to me:

committee  rule variable dir  value category        type  percentile 
[          1     1       18  <=   1.84           continuous        0.91] 
[          1     2       18   >   1.84           continuous        0.09] 
[          3     1       12   >   1.07           continuous        0.26] 
[          3     2       12  <=   1.07           continuous        0.74] 
[          5     1       12   >   1.07           continuous        0.26] 
[          5     2       12  <=   1.07           continuous        0.74]

so we just have (committe, rule) = {(1, 1), (1, 2), (3, 1), (3, 2), (5, 1), (5, 2)} but when we take a look at model.coeff_:

(Intercept)   0      1      2   3  ...  17     18  19  committee  rule 
[0     9216.804 NaN  164.0 -230.0 NaN  ... NaN -462.0 NaN          1     1] 
[1    12447.445 NaN    NaN    NaN NaN  ... NaN    NaN NaN          1     2] 
[2     9673.881 NaN    NaN    NaN ..............                                2    1]
[3     ......................                                     3    1]
[4     ......................                                     3    2]
[5     ......................                                     4    1]
[6     ......................                                     5    1]
[7     ......................                                     5    2]

It seems that (2, 1) and (4, 1) are omitted (probably just because they include one rule?) but since a linear model is created for them, shouldn't there be rules ? Probably they are pruned ? Unfortunately, not sure whats happening here.

Thanks and best regards!

pjaselin commented 1 month ago

Hi @Paulnkk could you try pulling the https://github.com/pjaselin/Cubist/tree/feature/viz branch, use that with your code, and see if that fixes this? I'm going to make a change such that rules becomes splits per the R library (backing up on that),

Paulnkk commented 1 month ago

@pjaselin Thanks for the response! Can the changes be integrated into the latest version of Cubist? Please let me know if you need support for the newest release, I am happy to assist you

pjaselin commented 1 month ago

Did you check to see if those changes work for you? I’m wrapping up some new visualization utilities

On Sun, Oct 27, 2024 at 7:24 AM Paul @.***> wrote:

@pjaselin https://github.com/pjaselin Thanks for the response! Can the changes be integrated into the latest version of Cubist? Please let me know if you need support for the newest release, I am happy to assist you

— Reply to this email directly, view it on GitHub https://github.com/pjaselin/Cubist/issues/155#issuecomment-2439974586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGLQIOQ5GGQYGENSUTPYLHLZ5S5PPAVCNFSM6AAAAABPY37MFCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZHE3TINJYGY . You are receiving this because you were mentioned.Message ID: @.***>

Paulnkk commented 3 weeks ago

@pjaselin thanks for the response and I checked with your current version on the branch and there still seems to be the issue that single rules in committees occur in model.coeffs_ but not in model.splits_ (which are referring to the updated rules and committees)

pjaselin commented 3 weeks ago

I think the reason is that a single rule has no splits right? The rule should be true for all values.

Paulnkk commented 3 weeks ago

@pjaselin thanks for the fast response and this makes sense to me

Paulnkk commented 3 weeks ago

@pjaselin I used the branch you suggested and now I receive the following error message:

raise NotFittedError(msg % {"name": type(estimator).__name__})
sklearn.exceptions.NotFittedError: This Cubist instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

when calling m.predict(X). I am sure I fitted the model to train data correctly with m.fit(X_train, y_train) since attributes like m.coeffs_ and m.splits_ are available.

Edit: Found the error; in cubist.py, we have:

# make sure the model has been fitted
        check_is_fitted(
            self, attributes=["model_", "splits_", "coeff_", "feature_importances_"]
        )

but with the changes you made in https://github.com/pjaselin/Cubist/tree/feature/viz, we need:

# make sure the model has been fitted
        check_is_fitted(
            self, attributes=["model_", "splits_", "coeffs_", "feature_importances_"]
        )
pjaselin commented 3 weeks ago

Could you share a reproducible example? I'm not sure what's wrong here (especially since this is the first example of such an error) or if it's a bug in your code?

Paulnkk commented 3 weeks ago

@pjaselin marked the error related to the problem above

pjaselin commented 3 weeks ago

Got it and also realized I need to use that in my test cases every time I use check_fitted so thanks for catching this! Have some more updates then but I've fixed that in the current branch.

pjaselin commented 3 weeks ago

It wasn't a big change but did fix an issue where one of my tests was technically not correct but it's definitely an improvement

Paulnkk commented 6 days ago

@pjaselin one question; I currently train cubist on a 20-dimensional function. Here I encounter for 50 samples the situation that model.coeffs_ returns a table of models:

(Intercept)   0   1   2   3   4  ...  16  17  18      19  committee  rule [0      9647.322 NaN NaN NaN NaN NaN  ... NaN NaN NaN     NaN          1     1] [1      9797.073 NaN NaN NaN NaN NaN  ... NaN NaN NaN     NaN          2     1] [2      9182.491 NaN NaN NaN NaN NaN  ... NaN NaN NaN -1542.0          3     1] [3      9797.073 NaN NaN NaN NaN NaN  ... NaN NaN NaN     NaN          4     1] [4      9182.491 NaN NaN NaN NaN NaN  ... NaN NaN NaN -1542.0          5     1] [5      9797.073 NaN NaN NaN NaN NaN  ... NaN NaN NaN     NaN          6     1] [6      9182.491 NaN NaN NaN NaN NaN  ... NaN NaN NaN -1542.0          7     1] [7      9797.073 NaN NaN NaN NaN NaN  ... NaN NaN NaN     NaN          8     1] [8      9182.491 NaN NaN NaN NaN NaN  ... NaN NaN NaN -1542.0          9     1] [9      9797.073 NaN NaN NaN NaN NaN  ... NaN NaN NaN     NaN         10     1] [10     9182.491 NaN NaN NaN NaN NaN  ... NaN NaN NaN -1542.0         11     1] [11     9797.073 NaN NaN NaN NaN NaN  ... NaN NaN NaN     NaN    ...]

but model.splits_ is empty. I am not sure if this is the desired behaviour, I remember that we discussed about that before. Since the model actually generates model there must be rules that are generated. Currently, I am artificially generating rules but this could lead to inconsistencies. What should be the actual output here ?

Edit:

I get this output, so there must be rules which apply for all cases:

    Target attribute `outcome'
Read 50 cases (21 attributes)
Model 1:
  Rule 1/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2885.358]
    outcome = 9647.322 - 685 13 - 545 5
Model 2:
  Rule 2/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
    outcome = 9797.073
Model 3:
  Rule 3/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
    outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 4:
  Rule 4/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
    outcome = 9797.073
Model 5:
  Rule 5/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
    outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 6:
  Rule 6/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
    outcome = 9797.073
Model 7:
  Rule 7/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
    outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 8:
  Rule 8/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
    outcome = 9797.073
Model 9:
  Rule 9/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
    outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 10:
  Rule 10/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
    outcome = 9797.073
Model 11:
  Rule 11/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
    outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 12:
  Rule 12/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
    outcome = 9797.073
Model 13:
  Rule 13/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
    outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 14:
  Rule 14/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
    outcome = 9797.073
Model 15:
  Rule 15/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
    outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 16:
  Rule 16/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
    outcome = 9797.073
Model 17:
  Rule 17/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
    outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 18:
  Rule 18/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
    outcome = 9797.073
Model 19:
  Rule 19/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
    outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 20:
  Rule 20/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
    outcome = 9797.073
Evaluation on training data (50 cases):
    Average  |error|           2525.645
    Relative |error|               0.88
    Correlation coefficient        0.40
    Attribute usage:
      Conds  Model
              45%    8
              45%    10
              45%    19
               5%    5
               5%    13
Time: 0.0 secs

We also discussed about that concerning a similar topic https://github.com/pjaselin/Cubist/issues/148

pjaselin commented 6 days ago

Hi @Paulnkk I'm certainly open to figuring out a better way of representing Cubist's output so we can continue discussing that. Can you share a minimum reproducible example? That is, can you provide code that reaches this and not just the output itself? It looks like you returned 20 committees with no splits.

Paulnkk commented 5 days ago

Thanks and sure I will share the code with you; I am generating datapoints with the following:

# Creating the data
d = 20
lb_array = np.full(d, -2.048)
ub_array = np.full(d, 2.048)

np.random.seed(random_seed)
X = np.random.uniform(lb_array, ub_array, (number_of_samples, d))  # Random values for x

# Creating a d-dimensional Rosenbrock function
y = np.sum(100 * (X[:, 1:] - X[:, :-1] ** 2) ** 2 + (X[:, :-1] - 1) ** 2, axis=1)

data = {f'x_{i + 1}': X[:, i] for i in range(d)}
data['y'] = y

# Creating DataFrame
df = pd.DataFrame(data)

# Splitting into features and labels
X = df.drop('y', axis=1)
X = transform_feature_header(X)
n = X.shape[1]
y = df['y']

Please note that transform_feature_header() is a function to transform the column names of the dataset to strings. I am using random_seed = 100 and set the Cubist parameters to: 'n_committees': 20, 'n_rules': 100, 'unbiased': False, 'extrapolation': 0.05, 'sample': None, 'cv': None

I am not using the random seed for random_state as a Cubist parameter but plan to do this in the future.

pjaselin commented 5 days ago

Can you share your more complete code including the Cubist model training steps? Again there are no splits since each committee is comprised of one rule with no splits. If you look at the original R package, this is also how it was architected but I'm open to rearchitecting the package to communicate the relationship between rules, splits, and coefficients better if you have some suggestions. Also FYI Cubist sets a default random_state that you can change.

Paulnkk commented 5 days ago

Yes sure; I generate the data as above and then hand-over the data to:

model = Cubist(verbose=True, n_committees=n_committees, n_rules=n_rules, unbiased=unbiased,
                   extrapolation=extrapolation, sample=sample, cv=cv)
model.fit(X_train, y_train)

with the parameters from the previous message. After that I access the Cubist data by model.splits_ and model.coeffs_. A note on the implementation is that transform_feature_header(X) looks like:

def transform_feature_header(X):
    # TODO: We assume that headers for features are strings,
    #  make adjustment here that we just order for numerical feature references
    new_columns = [str(i) for i in range(len(X.columns))]
    X.columns = new_columns
    return X

Also another note based on the training data generation, I referenced above: I always assume that feature x is between a lower and upper bound (in the example above -2.048 and 2.048, respectively).

Another note how I handle the situation where models are returned with 1 split per committee (since I assume that no actual rules are returned here); currently in this situation, I use the following logic (please note that this is the function how I train the Cubist model, when receiving model parameters):

def train_cubist(X_train, y_train, params_cubo, lb, ub):
    # Show output
    cubist_train_params = params_cubo['cubist_train_params']

    n_committees = cubist_train_params['n_committees']
    n_rules = cubist_train_params['n_rules']
    unbiased = cubist_train_params['unbiased']
    extrapolation = cubist_train_params['extrapolation']
    sample = cubist_train_params['sample']
    cv = cubist_train_params['cv']
    model = Cubist(verbose=True, n_committees=n_committees, n_rules=n_rules, unbiased=unbiased,
                   extrapolation=extrapolation, sample=sample, cv=cv)
    model.fit(X_train, y_train)

    if model.coeffs_['rule'].nunique() == 1 and model.coeffs_['rule'].unique()[0] == 1:
        rules_data, coeff_data = generate_rules_data_for_small_n_datapoints(model.coeffs_, X_train, lb, ub)
        model.splits_ = rules_data
        model.coeffs_ = coeff_data
        return model.splits_, model.coeffs_, model

    return model.splits_, model.coeffs_, model

with the function generate_rules_data_for_small_n_datapoints():

 def generate_rules_data_for_small_n_datapoints(coeff_data, X, lb, ub):
    # NOTE: IMPORTANT; we use this function to generate rule structure for small set of datapoints
    if coeff_data['rule'].nunique() == 1 and coeff_data['rule'].unique()[0] == 1:
        # Initialize the new DataFrame
        rules_data = pd.DataFrame(columns=['committee', 'rule', 'variable', 'dir', 'value'])

        # Loop over the 'committee' column values in coeff_data
        for c in coeff_data['committee'].unique():
            # NOTE: IMPORTANT; we assume here that column names of dataset X
            #  can be used to access associated indices in lower and upper bound arrays.
            #  Please note if we use external function data, we might need to do pre-processing to ensure this structure
            for col_index, col in enumerate(X.columns):
                min_val = lb[col_index]
                max_val = ub[col_index]

                # Construct the rule data using separate rows for '>' and '<='
                rule_gt = {'committee': c, 'rule': 1, 'variable': col, 'dir': '>', 'value': min_val}
                rule_le = {'committee': c, 'rule': 1, 'variable': col, 'dir': '<=', 'value': max_val}

                # Append the rules to the DataFrame
                rules_data = pd.concat([rules_data, pd.DataFrame([rule_gt])], ignore_index=True)
                rules_data = pd.concat([rules_data, pd.DataFrame([rule_le])], ignore_index=True)

        return rules_data, coeff_data
    else:
        # If the condition is not met, return None or an empty DataFrame
        print(f"Something went wrong when generating rules and coeff data for single regression model")
        return pd.DataFrame(columns=['committee', 'rule', 'variable', 'dir', 'value'])

that on each feature dimension of x, I impose one rule that each feature dimension is in the lower and upper bound splits (which is consistent with the fact that each datapoint satisfies this rules). This does not seem to be the best way since there must be differences in rules because the models also do not look all the same.

pjaselin commented 4 days ago

I guess I'm not sure what the problem is. Do we need to develop a richer way to communicate what Cubist generates? Otherwise it seems to behave as expected.

Paulnkk commented 4 days ago

@pjaselin thanks for checking! I would be interested in how the rules look like that relate to the generated models like for instance; rule1: rule 1= {x_1 <= 1, x_2 > 4,...}, rule 2= {x_1 > 2, x3 <= 10,...}. Since the array/dataframe related to `model.splits` is empty, I do not have access to this information. Or does Cubist just do not generate actual rules in those cases ?

Edit: I should mention that I need specific information about how the rules look like when I want to use my framework. Therefore, I 'artificially' created rules by generating rules as mentioned above with the approach that I set rules which per definition are satisfied for each datapoint.

pjaselin commented 4 days ago

I guess I'm confused, the verbose output shows the rules and the model attributes show the splits and coefficients. You requested 20 committees and the verbose output shows there are 20 committees, each containing one rule where each rule has no splits (corresponding with the verbose output). It makes sense that the splits attribute would be empty and the coeffs attribute has values. There's no "other" data that's being hidden. I'm open to exploring a new way to communicate the models but you could also explore/compile the code if you want to see this in action.

Paulnkk commented 4 days ago

Thanks, so this may be a misunderstanding from my side; what does it mathematically mean that there is a rule with no splits ? Is each rule an infinitely large set which contains all elements from the samples ? From the perspective of my application I need an algebraic representation of the rules (in terms of splits)

Edit: If this is generally the way how this model is constructed; so that models are generated but there are no algebraic representation of the rules available, I need to think of another way to deal with the situation

pjaselin commented 4 days ago

If you take a look at the verbose output, you'll see no splits so the linear equation for a given rule is true for the whole dataset. You might want to make sure you understand the relationship between committees and rules as well.

Paulnkk commented 4 days ago

As I understand it correctly do committees include several rules and are kind of similar to ensembles in regular decision trees and by using committees we can achieve robustness in theory. But I think I got the problem; my framework depends on receiving algebraic expressions for rules each iteration, I can find a way to do adjustments in those situations

pjaselin commented 4 days ago

This is the "outcome = ..." part of the verbose output, the model is returning either an expression or potentially a constant that's true for all values.

Paulnkk commented 4 days ago

Ok I understand, I plan to check manually but I assume in those situations the average over all models in a committee is returned in the forecast of the model and this is then averaged across committees as well

pjaselin commented 4 days ago

Ok sounds good, I'm still open to coming up with a better way to communicate this though

Paulnkk commented 3 days ago

I think the correct output is communicated, personally what could help is a note in the Readme file that this kind of situation can occur that people using the library are not confused if they are interested in accessing the rules output

pjaselin commented 3 days ago

Ok I'll add that note in the README.md