numbats / wages-refresh

This will be a repo for re-building, and documenting the process of the wages data from the NLSY
MIT License
0 stars 0 forks source link

Result of Robust Linear Model #3

Closed Dewi-Amaliah closed 3 years ago

Dewi-Amaliah commented 3 years ago

Hi Professor Di and Emi,

I have built a robust linear model (using rlm function from MASS) by ID for all of the IDs in the dataset, and it is way faster and computable than using the robust linear mix model. Further, if the observation’s weight is <1, I imputed that observation and change its value with the prediction value of the model.

I found the result as seen in this picture: Screen Shot 2021-02-19 at 6 41 31 pm

I also plotted some of the IDs and compare them by the cleaning method: Screen Shot 2021-02-19 at 6 42 31 pm

Screen Shot 2021-02-19 at 7 05 17 pm

Screen Shot 2021-02-19 at 7 28 55 pm

In addition, the summary of the current clean data is: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.01 4.50 7.13 9.26 11.53 268.33

And the cleaned data using robust linear model is Min. 1st Qu. Median Mean 3rd Qu. Max. 0.020 4.629 7.195 9.165 11.441 192.300

Based on this result, should the data from the robust linear model cleaning used in the final data? Or is there anything I should explore further?

Thank you, Dewi

dicook commented 3 years ago

Hi Dewi, Yes 😺

Sent from my iPhone

On 19 Feb 2021, at 7:32 pm, Dewi Amaliah notifications@github.com wrote:

 Hi Professor Di and Emi,

I have built a robust linear model (using rlm function from MASS) by ID for all of the IDs in the dataset, and it is way faster and computable than using the robust linear mix model. Further, if the observation’s weight is <1, I imputed that observation and change its value with the prediction value of the model.

I found the result as seen in this picture:

I also plotted some of the IDs and compare them by the cleaning method:

In addition, the summary of the current clean data is: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.01 4.50 7.13 9.26 11.53 268.33

And the cleaned data using robust linear model is Min. 1st Qu. Median Mean 3rd Qu. Max. 0.020 4.629 7.195 9.165 11.441 192.300

Based on this result, should the data from the robust linear model cleaning used in the final data? Or is there anything I should explore further?

Thank you, Dewi

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.