Open Paulnkk opened 1 month ago
Hi @Paulnkk could you try pulling the https://github.com/pjaselin/Cubist/tree/feature/viz branch, use that with your code, and see if that fixes this? I'm going to make a change such that rules becomes splits per the R library (backing up on that),
@pjaselin Thanks for the response! Can the changes be integrated into the latest version of Cubist? Please let me know if you need support for the newest release, I am happy to assist you
Did you check to see if those changes work for you? I’m wrapping up some new visualization utilities
On Sun, Oct 27, 2024 at 7:24 AM Paul @.***> wrote:
@pjaselin https://github.com/pjaselin Thanks for the response! Can the changes be integrated into the latest version of Cubist? Please let me know if you need support for the newest release, I am happy to assist you
— Reply to this email directly, view it on GitHub https://github.com/pjaselin/Cubist/issues/155#issuecomment-2439974586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGLQIOQ5GGQYGENSUTPYLHLZ5S5PPAVCNFSM6AAAAABPY37MFCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZHE3TINJYGY . You are receiving this because you were mentioned.Message ID: @.***>
@pjaselin thanks for the response and I checked with your current version on the branch and there still seems to be the issue that single rules in committees occur in model.coeffs_
but not in model.splits_
(which are referring to the updated rules and committees)
I think the reason is that a single rule has no splits right? The rule should be true for all values.
@pjaselin thanks for the fast response and this makes sense to me
@pjaselin I used the branch you suggested and now I receive the following error message:
raise NotFittedError(msg % {"name": type(estimator).__name__})
sklearn.exceptions.NotFittedError: This Cubist instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
when calling m.predict(X)
. I am sure I fitted the model to train data correctly with m.fit(X_train, y_train)
since attributes like m.coeffs_
and m.splits_
are available.
Edit: Found the error; in cubist.py
, we have:
# make sure the model has been fitted
check_is_fitted(
self, attributes=["model_", "splits_", "coeff_", "feature_importances_"]
)
but with the changes you made in https://github.com/pjaselin/Cubist/tree/feature/viz
, we need:
# make sure the model has been fitted
check_is_fitted(
self, attributes=["model_", "splits_", "coeffs_", "feature_importances_"]
)
Could you share a reproducible example? I'm not sure what's wrong here (especially since this is the first example of such an error) or if it's a bug in your code?
@pjaselin marked the error related to the problem above
Got it and also realized I need to use that in my test cases every time I use check_fitted
so thanks for catching this! Have some more updates then but I've fixed that in the current branch.
It wasn't a big change but did fix an issue where one of my tests was technically not correct but it's definitely an improvement
@pjaselin one question; I currently train cubist on a 20-dimensional function. Here I encounter for 50 samples the situation that model.coeffs_
returns a table of models:
(Intercept) 0 1 2 3 4 ... 16 17 18 19 committee rule [0 9647.322 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 1 1] [1 9797.073 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 2 1] [2 9182.491 NaN NaN NaN NaN NaN ... NaN NaN NaN -1542.0 3 1] [3 9797.073 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 4 1] [4 9182.491 NaN NaN NaN NaN NaN ... NaN NaN NaN -1542.0 5 1] [5 9797.073 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 6 1] [6 9182.491 NaN NaN NaN NaN NaN ... NaN NaN NaN -1542.0 7 1] [7 9797.073 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 8 1] [8 9182.491 NaN NaN NaN NaN NaN ... NaN NaN NaN -1542.0 9 1] [9 9797.073 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 10 1] [10 9182.491 NaN NaN NaN NaN NaN ... NaN NaN NaN -1542.0 11 1] [11 9797.073 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN ...]
but model.splits_
is empty. I am not sure if this is the desired behaviour, I remember that we discussed about that before. Since the model actually generates model there must be rules that are generated. Currently, I am artificially generating rules but this could lead to inconsistencies. What should be the actual output here ?
Edit:
I get this output, so there must be rules which apply for all cases:
Target attribute `outcome'
Read 50 cases (21 attributes)
Model 1:
Rule 1/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2885.358]
outcome = 9647.322 - 685 13 - 545 5
Model 2:
Rule 2/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
outcome = 9797.073
Model 3:
Rule 3/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 4:
Rule 4/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
outcome = 9797.073
Model 5:
Rule 5/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 6:
Rule 6/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
outcome = 9797.073
Model 7:
Rule 7/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 8:
Rule 8/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
outcome = 9797.073
Model 9:
Rule 9/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 10:
Rule 10/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
outcome = 9797.073
Model 11:
Rule 11/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 12:
Rule 12/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
outcome = 9797.073
Model 13:
Rule 13/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 14:
Rule 14/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
outcome = 9797.073
Model 15:
Rule 15/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 16:
Rule 16/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
outcome = 9797.073
Model 17:
Rule 17/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 18:
Rule 18/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
outcome = 9797.073
Model 19:
Rule 19/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 3233.515]
outcome = 9182.491 - 1578 8 - 1283 10 - 1542 19
Model 20:
Rule 20/1: [50 cases, mean 9533.450, range 3676.637 to 17543.27, est err 2995.696]
outcome = 9797.073
Evaluation on training data (50 cases):
Average |error| 2525.645
Relative |error| 0.88
Correlation coefficient 0.40
Attribute usage:
Conds Model
45% 8
45% 10
45% 19
5% 5
5% 13
Time: 0.0 secs
We also discussed about that concerning a similar topic https://github.com/pjaselin/Cubist/issues/148
Hi @Paulnkk I'm certainly open to figuring out a better way of representing Cubist's output so we can continue discussing that. Can you share a minimum reproducible example? That is, can you provide code that reaches this and not just the output itself? It looks like you returned 20 committees with no splits.
Thanks and sure I will share the code with you; I am generating datapoints with the following:
# Creating the data
d = 20
lb_array = np.full(d, -2.048)
ub_array = np.full(d, 2.048)
np.random.seed(random_seed)
X = np.random.uniform(lb_array, ub_array, (number_of_samples, d)) # Random values for x
# Creating a d-dimensional Rosenbrock function
y = np.sum(100 * (X[:, 1:] - X[:, :-1] ** 2) ** 2 + (X[:, :-1] - 1) ** 2, axis=1)
data = {f'x_{i + 1}': X[:, i] for i in range(d)}
data['y'] = y
# Creating DataFrame
df = pd.DataFrame(data)
# Splitting into features and labels
X = df.drop('y', axis=1)
X = transform_feature_header(X)
n = X.shape[1]
y = df['y']
Please note that transform_feature_header()
is a function to transform the column names of the dataset to strings. I am using random_seed = 100
and set the Cubist
parameters to:
'n_committees': 20, 'n_rules': 100, 'unbiased': False, 'extrapolation': 0.05, 'sample': None, 'cv': None
I am not using the random seed for random_state
as a Cubist
parameter but plan to do this in the future.
Can you share your more complete code including the Cubist model training steps? Again there are no splits since each committee is comprised of one rule with no splits. If you look at the original R package, this is also how it was architected but I'm open to rearchitecting the package to communicate the relationship between rules, splits, and coefficients better if you have some suggestions. Also FYI Cubist sets a default random_state
that you can change.
Yes sure; I generate the data as above and then hand-over the data to:
model = Cubist(verbose=True, n_committees=n_committees, n_rules=n_rules, unbiased=unbiased,
extrapolation=extrapolation, sample=sample, cv=cv)
model.fit(X_train, y_train)
with the parameters from the previous message. After that I access the Cubist data by model.splits_
and model.coeffs_
.
A note on the implementation is that transform_feature_header(X)
looks like:
def transform_feature_header(X):
# TODO: We assume that headers for features are strings,
# make adjustment here that we just order for numerical feature references
new_columns = [str(i) for i in range(len(X.columns))]
X.columns = new_columns
return X
Also another note based on the training data generation, I referenced above: I always assume that feature x
is between a lower and upper bound (in the example above -2.048
and 2.048
, respectively).
Another note how I handle the situation where models are returned with 1 split per committee (since I assume that no actual rules are returned here); currently in this situation, I use the following logic (please note that this is the function how I train the Cubist model, when receiving model parameters):
def train_cubist(X_train, y_train, params_cubo, lb, ub):
# Show output
cubist_train_params = params_cubo['cubist_train_params']
n_committees = cubist_train_params['n_committees']
n_rules = cubist_train_params['n_rules']
unbiased = cubist_train_params['unbiased']
extrapolation = cubist_train_params['extrapolation']
sample = cubist_train_params['sample']
cv = cubist_train_params['cv']
model = Cubist(verbose=True, n_committees=n_committees, n_rules=n_rules, unbiased=unbiased,
extrapolation=extrapolation, sample=sample, cv=cv)
model.fit(X_train, y_train)
if model.coeffs_['rule'].nunique() == 1 and model.coeffs_['rule'].unique()[0] == 1:
rules_data, coeff_data = generate_rules_data_for_small_n_datapoints(model.coeffs_, X_train, lb, ub)
model.splits_ = rules_data
model.coeffs_ = coeff_data
return model.splits_, model.coeffs_, model
return model.splits_, model.coeffs_, model
with the function generate_rules_data_for_small_n_datapoints()
:
def generate_rules_data_for_small_n_datapoints(coeff_data, X, lb, ub):
# NOTE: IMPORTANT; we use this function to generate rule structure for small set of datapoints
if coeff_data['rule'].nunique() == 1 and coeff_data['rule'].unique()[0] == 1:
# Initialize the new DataFrame
rules_data = pd.DataFrame(columns=['committee', 'rule', 'variable', 'dir', 'value'])
# Loop over the 'committee' column values in coeff_data
for c in coeff_data['committee'].unique():
# NOTE: IMPORTANT; we assume here that column names of dataset X
# can be used to access associated indices in lower and upper bound arrays.
# Please note if we use external function data, we might need to do pre-processing to ensure this structure
for col_index, col in enumerate(X.columns):
min_val = lb[col_index]
max_val = ub[col_index]
# Construct the rule data using separate rows for '>' and '<='
rule_gt = {'committee': c, 'rule': 1, 'variable': col, 'dir': '>', 'value': min_val}
rule_le = {'committee': c, 'rule': 1, 'variable': col, 'dir': '<=', 'value': max_val}
# Append the rules to the DataFrame
rules_data = pd.concat([rules_data, pd.DataFrame([rule_gt])], ignore_index=True)
rules_data = pd.concat([rules_data, pd.DataFrame([rule_le])], ignore_index=True)
return rules_data, coeff_data
else:
# If the condition is not met, return None or an empty DataFrame
print(f"Something went wrong when generating rules and coeff data for single regression model")
return pd.DataFrame(columns=['committee', 'rule', 'variable', 'dir', 'value'])
that on each feature dimension of x
, I impose one rule that each feature dimension is in the lower and upper bound splits (which is consistent with the fact that each datapoint satisfies this rules). This does not seem to be the best way since there must be differences in rules because the models also do not look all the same.
I guess I'm not sure what the problem is. Do we need to develop a richer way to communicate what Cubist generates? Otherwise it seems to behave as expected.
@pjaselin thanks for checking! I would be interested in how the rules look like that relate to the generated models like for instance; rule1: rule 1= {x_1 <= 1, x_2 > 4,...}, rule 2= {x_1 > 2, x3 <= 10,...}. Since the array/dataframe related to `model.splits` is empty, I do not have access to this information. Or does Cubist just do not generate actual rules in those cases ?
Edit: I should mention that I need specific information about how the rules look like when I want to use my framework. Therefore, I 'artificially' created rules by generating rules as mentioned above with the approach that I set rules which per definition are satisfied for each datapoint.
I guess I'm confused, the verbose output shows the rules and the model attributes show the splits and coefficients. You requested 20 committees and the verbose output shows there are 20 committees, each containing one rule where each rule has no splits (corresponding with the verbose output). It makes sense that the splits attribute would be empty and the coeffs attribute has values. There's no "other" data that's being hidden. I'm open to exploring a new way to communicate the models but you could also explore/compile the code if you want to see this in action.
Thanks, so this may be a misunderstanding from my side; what does it mathematically mean that there is a rule with no splits ? Is each rule an infinitely large set which contains all elements from the samples ? From the perspective of my application I need an algebraic representation of the rules (in terms of splits)
Edit: If this is generally the way how this model is constructed; so that models are generated but there are no algebraic representation of the rules available, I need to think of another way to deal with the situation
If you take a look at the verbose output, you'll see no splits so the linear equation for a given rule is true for the whole dataset. You might want to make sure you understand the relationship between committees and rules as well.
As I understand it correctly do committees include several rules and are kind of similar to ensembles in regular decision trees and by using committees we can achieve robustness in theory. But I think I got the problem; my framework depends on receiving algebraic expressions for rules each iteration, I can find a way to do adjustments in those situations
This is the "outcome = ..." part of the verbose output, the model is returning either an expression or potentially a constant that's true for all values.
Ok I understand, I plan to check manually but I assume in those situations the average over all models in a committee is returned in the forecast of the model and this is then averaged across committees as well
Ok sounds good, I'm still open to coming up with a better way to communicate this though
I think the correct output is communicated, personally what could help is a note in the Readme file that this kind of situation can occur that people using the library are not confused if they are interested in accessing the rules output
Ok I'll add that note in the README.md
Hey,
I was generating samples and train
Cubist()
based on a 20-dimensional Rosenbrock function. Here I encountered the case thatmodel.rules_
returned to me:so we just have (committe, rule) = {(1, 1), (1, 2), (3, 1), (3, 2), (5, 1), (5, 2)} but when we take a look at
model.coeff_
:It seems that (2, 1) and (4, 1) are omitted (probably just because they include one rule?) but since a linear model is created for them, shouldn't there be rules ? Probably they are pruned ? Unfortunately, not sure whats happening here.
Thanks and best regards!