lelaboratoire / tpot-fss

Application of TPOT's new operators DatasetSelector
7 stars 5 forks source link

Doubts on Feature Subset Selector #1

Closed alexgcsa closed 4 years ago

alexgcsa commented 4 years ago

Hi @trang1618 ,

It is mentioned at section 2.1.1 (First Paragraph) that "From pre-defined subsets of features, the FSS operator allows TPOT to select the best subset that maximizes average accuracy in k-fold cross validation (5-fold by default)."

I was wondering how this is done...I mean, which classifier/regressor do you use to define the best subset of features to be used.

Given Figure 1 in the paper, I imagine that TPOT-FSS selects only one subset to continue. But I haven't understood how this selection is performed.

Can you provide an explanation for this part of the paper?

Cheers,

Alex

trangdata commented 4 years ago

Hi @alexgcsa, thank you for the questions!

which classifier/regressor do you use to define the best subset of features to be used.

By default, TPOT uses all the classifiers here and regressors similarly. You can also specify the operators you want to include (or remove time-consuming operators) by modifying the config_dict parameter in the TPOTClassifier or TPOTRegression function. More details here.

I imagine that TPOT-FSS selects only one subset to continue.

You can allow TPOT to select more than one subset by using template = 'DatasetSelector-CombineDFs-Transformer-Classifier' and specify sel_subset in your configuration argument as described here.

But I haven't understood how this selection is performed.

The selection is optimized just as how the transformers and classifiers are optimized: by maximizing cross-validated performance based on whichever metrics you specify (default: balance accuracy).

I hope that helps!

alexgcsa commented 4 years ago

Hi @trang1618 , thank you for your answers!

As far as I understood, what Feature Subset Selector does is to select the features before the optimization done by TPOT. In this case, the features would be selected at first. Thereafter, you continue with TPOT to search for the best classifier, etc. Am I correct?

My issue is related how the features are selected by Feature Subset Selector (FSS). Because, apriori, you don`t have any classifier/regressor chosen. So, how are the subset of features evaluated ?

Suppose you have 3 subsets: subset0, subset_1, subset_2.

How does FSS decide among them to continue the optimization?

trangdata commented 4 years ago

Each subset would be treated as an "operator" (along with transformer and classfier) included in the pipeline "individual" that were randomly initialized in the first generation and then mutated in the later generations.

alexgcsa commented 4 years ago

oh, I see.

Thank you so much. It helped a lot :)