My notes about possible improvements from Euroscipy tutorial

lesteve commented 5 years ago

This is not very structured, so feel free to edit, comment, open other issues for bigger chunks of work:

Content

have a TOC per notebook?
tinyurl (or huit.re) with link for easier access to the github repo (README)
first notebook with TOC so that the binder goes directly to this notebook
Too-wide code: numerical_columns categorical_columns should cut at 'capital-loss' and 'marital-status'. I think we should have a special formatter maybe black (I feel like it takes too much vertical space) or maybe yapf with some nice settings.
Say that education-num is not the number of years of education (I say that we could expect this, but I did not say this was not true)
Should we get rid of education-num everywhere, since this is the same as education?
Young people work part-time. Say that non-working people (students) are not part of the survey.
Put solutions in different folder? They interfere with the notebooks, you have to say: open the 02 notebook but not the one with exercise ...
naming: df vs adult_census harmonize, maybe data is good enough.
Harmonize the way to get categorical_columns vs numerical_columns. Some code use dtype some code use explicit column names.
Side-comments about the train test split, goal is not to memorize. Should there be more details for the MOOC ? Or links to the first part about overfitting vs underfitting.
02 exercise 01, not cross_val_score but use train_test_split
Different kind of preprocessing, add a link to user guide. Question was: what happens if the data is not gaussian.

niter is a list for some reason ...

print(
f"The accuracy using a {model.__class__.__name__} is "
f"{model.score(data_test, target_test):.3f} with a fitting time of "
f"{elapsed_time:.3f} seconds in {model[-1].n_iter_} iterations"
)
The accuracy using a Pipeline is 0.818 with a fitting time of 0.809 seconds in [13] iterations

Cross-validation explanation plot: add legend for blue vs red. It looks like there might be better images from scikit-learn documentation.
minor: sparse=False in OneHotEncoder just for visualization purposes (to see the numpy array).
handle_unknown='ignore': explain more the reason: to put 0 in the categories if at test time, a category has not been seen in the train data.
For exercise, have a link to the similar example, e.g. OrdinalEncoder put a link to what we did with OneHotEncoder.
Can we have link in notebooks to an other notebook, that works locally, on binder, etc ...
Question about : pipeline with the scaler does it compute the mean on the training, so you have to explain how the Pipeline works, calls .fit and .transform. You don't have to explain maybe, you can just say the parameters are modified only in the .fit (so not in the .predict)
Question about pipeline, why is it useful rather than just writing the code yourself? You have to explain .fit and .fit_transform. Hmmm, maybe you can just add a comment about why the Pipeline is useful in general.

Miscellaneous

Timings are very slow on binder: 0.7 s for LogisticRegression fit vs 5.6s on binder. 2 minutes (rather than ~10s on Olivier's machine) for Reference pipeline (no numerical scaling and integer-coded categories) 02_basic_preprocessing_exercise_03_solution.ipynb

lucyleeow commented 4 years ago

Good points. My 2 cents:

handle_unknown='ignore': explain more the reason: to put 0 in the categories if at test time, a category has not been seen in the train data.

I almost included this in my suggestions. I agree and add that you should mention that OrdinalEncoder doesn't have a handle_unknown argument atm.

Question about : pipeline with the scaler does it compute the mean on the training, so you have to explain how the Pipeline works, calls .fit and .transform. You don't have to explain maybe, you can just say the parameters are modified only in the .fit (so not in the .predict)

I was confused about fit, transform and fit_transform in preprocessing functions and thought it was useful to understand this. It was good to learn that fit doesn't literally mean 'fit' in a preprocessing function, it just calculates the required parameters and saves them as self attributes - the name 'fit' is used for sklearn API purposes. I understand it as; fit is performed only on the training data and use can both x and y, whereas transform is performed on both training and test data (similar to predict).

lesteve commented 4 years ago

Moved to https://github.com/INRIA/scikit-learn-mooc/issues/4.

lesteve / scikit-learn-tutorial

My notes about possible improvements from Euroscipy tutorial #3

Content

Miscellaneous