scipy-lectures / scientific-python-lectures

Tutorial material on the scientific Python ecosystem
https://lectures.scientific-python.org
Other
3.09k stars 1.19k forks source link

Replace Boston housing example with California housing example or Ames housing example #501

Closed masmangan closed 2 years ago

masmangan commented 2 years ago

Boston housing is deprecated and will be removed from scikit-learn datasets. This would break the example from "3.6. scikit-learn: machine learning in Python", "Supervised Learning: Regression of Housing Data".

Reason: "DEPRECATED: load_boston is deprecated in 1.0 and will be removed in 1.2." "The Boston housing prices dataset has an ethical problem. You can refer to the documentation of this function for further details." "The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning." Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html

Discussion related to this issue on scikit-learn issues: https://github.com/scikit-learn/scikit-learn/issues/16155 https://github.com/scikit-learn/scikit-learn/pull/20729

This issue affects the following files: https://github.com/scipy-lectures/scipy-lecture-notes/blob/master/packages/scikit-learn/examples/plot_boston_prediction.py https://github.com/scipy-lectures/scipy-lecture-notes/blob/master/packages/scikit-learn/index.rst

Recommendation: a) Develop a plot_california_prediction.py or a plot_ames_prediction.py based on plot_boston_prediction.py b) Update index.rst in order to present plots from this new example c) Keep and update plot_boston_prediction.py to fetch original data from CMU or remove plot_boston_prediction.py

Alternative: Update index.rst in order to address ethical issues regarding Boston housing dataset.

tbb1984 commented 2 years ago

这是来自QQ邮箱的假期自动回复邮件。你好,我最近正在休假中,无法亲自回复你的邮件。我将在假期结束后,尽快给你回复。

pdebuyl commented 2 years ago

Thank you for the notice @masmangan We'll switch to the CA example following scikit-learn. If you wish to contribute the change as a pull request let me know, else it'll be done in the coming weeks.

masmangan commented 2 years ago

@pdebuyl I expect to have at least a partial solution for this issue by next week.

masmangan commented 2 years ago

Original Boston example is here: https://colab.research.google.com/drive/1PMos6Zy97IIil8X0de4zinFpuDjRskAd?usp=sharing

A working draft of new CA example is here: https://colab.research.google.com/drive/1KQcOTjKJRUnzAlNpBgTdJMBZHULiVeso?usp=sharing

Not done yet: a) Plotting histogram is not working. b) Features differ, labels need review.

Also, CA RMS is lower (0.5) than Boston RMS (2.5).

pdebuyl commented 2 years ago

Thanks for the work!

plt.scatter(data.data[feature_name], data.target)

did the plot in my test (with your notebook).

The RMS is lower, but so is the average :-) So the RMS / "mean" is probably (by looking at the plots) higher in CA.

The data is less "clean-looking" on the CA side, so the source might have not been processed in the same way. The "MedInc" plot is all over the place with plenty of near-zero revenue entries.

masmangan commented 2 years ago

Great! Plot is working now! Thanks!

masmangan commented 2 years ago

@pdebuyl This is only a partial solution. Hope it helps!

pdebuyl commented 2 years ago

It does thank you. I am testing the code now. I have had to update scikit-learn and to fix some related deprecations. I'll get it fully fixed later though (coming weeks, summertime).

pdebuyl commented 2 years ago

fixed by #502