Paper: Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions

AmadiGabriel commented 1 month ago

If you are creating this PR in order to submit a draft of your paper, please name your PR with Paper: <title>. An editor will then add a paper label and GitHub Actions will be run to check and build your paper.

See the project readme for more information.

Editor: Chris Calloway @cbcunc

Reviewers:

Jane Adams @janeadams
Andrei Paleyes @apaleyes
Amadi Gabriel Udu @AmadiGabriel

github-actions[bot] commented 1 month ago

Curvenote Preview

Directory	Preview	Checks	Updated (UTC)
papers/amadi_udu	🔍 Inspect	✅ 80 checks passed (4 optional)	Jul 10, 2024, 3:11 PM

ejm714 commented 1 month ago

@AmadiGabriel could you please be mindful in which github action runs you cancel? Looks like you've cancelled two of mine. I'm just trying to get my paper preview updated with a typo fix

apaleyes commented 1 month ago

Hello @AmadiGabriel ! I am one of the reviewers of your paper. Looking forward to reading it, please watch out for some comments here.

Also, a question to chairs (@cbcunc I guess). This PR is coming with lots of data and code in addition to the paper itself. Shall we be reviewing all of that material, or only the manuscript itself?

AmadiGabriel commented 1 month ago

Hello @AmadiGabriel ! I am one of the reviewers of your paper. Looking forward to reading it, please watch out for some comments here.

Also, a question to chairs (@cbcunc I guess). This PR is coming with lots of data and code in addition to the paper itself. Shall we be reviewing all of that material, or only the manuscript itself?

@apaleyes good to know. we would be looking forward to your comments.

AmadiGabriel commented 3 weeks ago

Very nicely written paper, the final render (which I was able to fish out of the build log) is beautiful and easy to read.

Many minor comments are below. I also have a general question for the authors. I struggle a bit to understand the main contribution or take away from this paper. Let's consider some ideas:

Is it the method itself? The method appears to be a combination of several well-established techniques, without too much customisation.

Is it the Python implementation of the method? If so, the emphasis should probably be more on the implementation detail, software design and reuse.

Is it the analysis of feature selection effects on datasets with class imbalance? If so, the paper needs more unified conclusions, general observations that span beyond single dataset. It probably also needs to analyse larger datasets.

Is it the analysis of the three selected models? If so, the selection of the models either needs to be expanded, or needs a solid justification (e.g. we used these models because they are most popular in a field X).

I think the paper can be improved a lot if it had this single (or multiple) contribution(s) strongly emphasised, motivated, and supported. Very happy to discuss this question more here, in this PR, as a part of the review process.

Thank you @apaleyes for these commendation, comments and observations raised. I would say that they are quite thorough and valuable. We would attempt to address all points raised and make clarifications where possible. I have drawn the attention of the co-authors appropriately. Do stand by for our responses within the week.

AmadiGabriel commented 3 weeks ago

@rowanc1 could you help check why check is failing?

rowanc1 commented 3 weeks ago

@AmadiGabriel, we are working through a change on the action. @fwkoch will be fixing this shortly and we will rerun it for you!

cbcunc commented 3 weeks ago

Review reminder sent to @janeadams

janeadams commented 2 weeks ago

A succinct and interesting read on evaluating permutation feature importance (PFI) impacts on three different classification models (Random Forest, LightGBM, and SVM) with varying proportions of subsampled data featuring unbalanced classes. I have minor comments but overall I think this a great contribution.

The dual axes in the processing time figure were odd to me at first; it might be valuable to explain that SVM's poor performance relative to the other two methods is likely due to its poor parallelizability (if that's a word)
The "decrease in AUC" figures are confusing in that negative x-axis values must therefore indicate increased in AUC? (Correct me if I am misunderstanding). This forces the reader to think about a "double negative makes a positive" which adds possibly unnecessary complexity to interpretation. I would recommend either 1) changing the axis / measure to just be "change in AUC" and/or 2) adding annotations directly onto the white space with an arrow indicating "poorer performance this direction" or similar.

I particularly appreciated the pre-filtering step of using hierarchical clustering of features to account for potential collinearities. I also appreciated that the authors used multiple data sets and evaluated at a range of sample proportions. This is a nice example of how a lot of scientific computing python libraries can come together into a single interesting experiment.

AmadiGabriel commented 2 weeks ago

A succinct and interesting read on evaluating permutation feature importance (PFI) impacts on three different classification models (Random Forest, LightGBM, and SVM) with varying proportions of subsampled data featuring unbalanced classes. I have minor comments but overall I think this a great contribution.

The dual axes in the processing time figure were odd to me at first; it might be valuable to explain that SVM's poor performance relative to the other two methods is likely due to its poor parallelizability (if that's a word)

The "decrease in AUC" figures are confusing in that negative x-axis values must therefore indicate increased in AUC? (Correct me if I am misunderstanding). This forces the reader to think about a "double negative makes a positive" which adds possibly unnecessary complexity to interpretation. I would recommend either 1) changing the axis / measure to just be "change in AUC" and/or 2) adding annotations directly onto the white space with an arrow indicating "poorer performance this direction" or similar.

I particularly appreciated the pre-filtering step of using hierarchical clustering of features to account for potential collinearities. I also appreciated that the authors used multiple data sets and evaluated at a range of sample proportions. This is a nice example of how a lot of scientific computing python libraries can come together into a single interesting experiment.

Thank you for the encouraging comments and observations on the paper @janeadams . We are currentlly addressing some of the comments raised by @apaleyes . Hopefully, all observations raised will be responded to early next week and the paper updated accordingly.

AmadiGabriel commented 1 week ago

Very nicely written paper, the final render (which I was able to fish out of the build log) is beautiful and easy to read.

Many minor comments are below. I also have a general question for the authors. I struggle a bit to understand the main contribution or take away from this paper. Let's consider some ideas:

Is it the method itself? The method appears to be a combination of several well-established techniques, without too much customisation.

Is it the Python implementation of the method? If so, the emphasis should probably be more on the implementation detail, software design and reuse.

Is it the analysis of feature selection effects on datasets with class imbalance? If so, the paper needs more unified conclusions, general observations that span beyond single dataset. It probably also needs to analyse larger datasets.

Is it the analysis of the three selected models? If so, the selection of the models either needs to be expanded, or needs a solid justification (e.g. we used these models because they are most popular in a field X).

I think the paper can be improved a lot if it had this single (or multiple) contribution(s) strongly emphasised, motivated, and supported. Very happy to discuss this question more here, in this PR, as a part of the review process.

Thank you @apaleyes for your insightful comments that has enhance the content and quality of the paper. Our ideas are relate closely to the latter two questions asked.

We have provided a more unified conclusion that clarifies that the paper is a preliminary study that considers five datasets with substantial sample size, characterised by class imbalance. A justification for the selection of the models has been included in the text. This informed the choice of PFI for the feature selection process owing to its advantage of being model-agnostic. Expansion to include other models and much larger datasets has been included in the conclusion for further study.

Very nicely written paper, the final render (which I was able to fish out of the build log) is beautiful and easy to read.

Many minor comments are below. I also have a general question for the authors. I struggle a bit to understand the main contribution or take away from this paper. Let's consider some ideas:

Is it the method itself? The method appears to be a combination of several well-established techniques, without too much customisation.

Is it the Python implementation of the method? If so, the emphasis should probably be more on the implementation detail, software design and reuse.

Is it the analysis of feature selection effects on datasets with class imbalance? If so, the paper needs more unified conclusions, general observations that span beyond single dataset. It probably also needs to analyse larger datasets.

Is it the analysis of the three selected models? If so, the selection of the models either needs to be expanded, or needs a solid justification (e.g. we used these models because they are most popular in a field X).

I think the paper can be improved a lot if it had this single (or multiple) contribution(s) strongly emphasised, motivated, and supported. Very happy to discuss this question more here, in this PR, as a part of the review process.

Thank you @apaleyes for the insightful comments provided to this paper, which has enhanced the quality and richness of the paper. The main contribution of the work border on the latter two questions you raised. Accordingly, we have provided a more unified conclusion and justified the selection of the models. As this is a preliminary investigation, we have also included as part of future work an expansion to introduce some quantitative measure of the variability of models and feature selection methods. Other comments have also been addressed.

AmadiGabriel commented 1 week ago

A succinct and interesting read on evaluating permutation feature importance (PFI) impacts on three different classification models (Random Forest, LightGBM, and SVM) with varying proportions of subsampled data featuring unbalanced classes. I have minor comments but overall I think this a great contribution.

The dual axes in the processing time figure were odd to me at first; it might be valuable to explain that SVM's poor performance relative to the other two methods is likely due to its poor parallelizability (if that's a word)

The "decrease in AUC" figures are confusing in that negative x-axis values must therefore indicate increased in AUC? (Correct me if I am misunderstanding). This forces the reader to think about a "double negative makes a positive" which adds possibly unnecessary complexity to interpretation. I would recommend either 1) changing the axis / measure to just be "change in AUC" and/or 2) adding annotations directly onto the white space with an arrow indicating "poorer performance this direction" or similar.

I particularly appreciated the pre-filtering step of using hierarchical clustering of features to account for potential collinearities. I also appreciated that the authors used multiple data sets and evaluated at a range of sample proportions. This is a nice example of how a lot of scientific computing python libraries can come together into a single interesting experiment.

Thank you @janeadams for the observations and review of the paper. This has provided clarity to aspects of the data visualisation and improved the deductions on the model performance.

An explanation for SVM’s poor performance included in the text. Axis changed to “change in AUC”. More explanation has been included to clarify positive and negative PFI performance results.

apaleyes commented 6 days ago

Lovely, thanks for all the work on updating the paper @AmadiGabriel ! I'll have another look shortly

cbcunc commented 1 day ago

@AmadiGabriel Good to meet you at SciPy. I am inviting you to review this paper. You were sent an invitation from GitHub to be a collaborator on this repository. Please accept the invitation. Your review should be in the form of GitHub review comments: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/commenting-on-a-pull-request

scipy-conference / scipy_proceedings

Paper: Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions #947