Incorrect data manipulation

GrahamMThomas commented 2 years ago

Regarding this page: https://github.com/microsoft/ML-For-Beginners/tree/main/2-Regression/3-Linear#prepare-your-data-for-regression

This step has a user convert string columns to integers. However in practice, this caused a numerical correlation between different package types. The correlation value between the different package types and price was purely coincidental.

This error caused the Linear Regression model to fail at a 0.35 Accuracy score. THIS is a problem because this problem is fairly linear in nature and a Linear Regression model should perform well. The reason is preformed poorly is obvious in this image.

The random package string that got assigned 4 is much smaller than package 2 or 3.

I'm no expert, but I believe the way you are supposed to tackle this problem is using One Hot Encoding. For Example

new_pumpkins = new_pumpkins.join(pd.get_dummies(new_pumpkins['Package']))

Then drop all other columns except for price. After doing this and fitting/predicting. The model accuracy is 0.85.

Finally, the example at the end saying to try to create a hypothetical example using 2.75, need to be removed as it enforces the erroneous assumption that the packages assigned integer values somehow has a midpoint.

Kamalesh3112 commented 2 years ago

Hi , This is Kamalesh . I would like to work and solve this issue .Can you assign the issue to me and assist me in solving it ?

GrahamMThomas commented 2 years ago

I am unable to assign the issue to you, as I don't have the permissions. But I'm down to help as much as I can!

Kamalesh3112 commented 2 years ago

Oh that's fine then . Could you be able to help me in this issue for further resolving it and can I start solving?

jlooper commented 2 years ago

hi, @GrahamMThomas and @Kamalesh3112 - I've been at a conference and wasn't able to reply earlier, but thank you, @GrahamMThomas for raising this issue. It's a very important fix to be made to the lesson! I also spoke to @shwars about it and we would indeed like to fix it as soon as possible. I'd recommend that we start by raising a PR to fix both the lesson text in English and the solution notebook. Then, I need to contact all translators to get the edits done across the board.

One thing to note is that it's ok to show the 'wrong way' of going about doing things as an example (in this case, converting to numeric values without using One Hot Encoding). But we should follow on immediately by showing the better way to solve the problem. So in this case we can add a lesson element about using this technique for much better results, so that the lesson can flow well.

We also can rethink the part at the bottom to give a better exercise.

thank you again!

Stapan17 commented 2 years ago

I would like to work on this issue if it's not assigned, should I start working on this issue?

Kamalesh3112 commented 2 years ago

@jlooper So shall I start solving this issue ?

jlooper commented 2 years ago

hi, I'm not assigning the issue to anyone, if someone would like to raise a PR I will review it

shwars commented 2 years ago

@Kamalesh3112 @Stapan17 @GrahamMThomas We had a discussion with @jlooper and agreed to change the lesson plan as follows:

Building a linear regression based on date/month - this is likely to give low accuracy, because relationship is inherently non-linear. This will heavily build on code and graphs discussed in the previous lesson.
Building a polynomial regression based on date/month - this should give better accuracy
Showing how to add non-numeric features using one-hot encoding

I was planning to start working on this next week, but I guess if someone wants to start working right away and make a PR - you are more than welcome to!

jlooper commented 2 years ago

Maybe we can break up the work and start on the third bullet point Dmitry notes above, to make it easier on everyone (I'm concerned about propagating this to translations in particular so it will take some organization), then address the date/month topic

booleans-oss commented 2 years ago

Building a linear regression based on date/month - this is likely to give low accuracy, because relationship is inherently non-linear. This will heavily build on code and graphs discussed in the previous lesson.

Building a polynomial regression based on date/month - this should give better accuracy

I implemented the two first bullet points on my side to see how much of an improvement that change can bring. I don't know if the accuracy of the model can be dependent of my computer or any other variable but here is what I got:

Screenshot

For the Linear Regression, I obtain a Model Accuracy of 0.019, which in my opinion makes sense because as you said, the relationship is non-linear. However, I expected a much larger accuracy with the Polynomial Regression. I got a Model Accuracy of 0.028 (an increase of 147%) but I think that this accuracy is a too low. As seen with @GrahamMThomas message, even the linear regression Package/Price gives a better model accuracy (0.35).

There is still a possibility that my code is incorrect so here is a Gist: https://gist.github.com/booleans-oss/ead3513b89c505132732c63975eea86c

After seeing the results, I tried to reproduce the lesson and do the regression for Package/Price. I got the following results:

Screenshots

This result reveals that there might be something different with either my code or my computer as we can see a bias since my results are ~50/60 smaller than the expected value. But this bias should not contradict the results of the regression Month/Price as it would not be a drastic change (due to the small nature of the values)

abetpal commented 2 years ago

@jlooper Can I work on this issue? It would be very helpful if you could point out which file need to be changed?

shwars commented 2 years ago

@abetpal thanks for offering help! A fix for this issue has been proposed already, you can have a look once it's merged and see if you have any further suggestions.

jlooper commented 2 years ago

I'm going to close this issue as Dmitry has posted a fix which I will be merging shortly. thank you everyone!

microsoft / ML-For-Beginners

Incorrect data manipulation #543