Data preprocessing - Githubissues

yim-fan commented 1 year ago

Preprocessing, clean up data, remove outliers, etc.

yim-fan commented 1 year ago

Almost done, will have another commit by the end of by Tuesday

pagand commented 1 year ago

@yim-fan please put a reference link to the git push so I can check the results.

yim-fan commented 1 year ago

Prepration/preprocessing.ipynb

yim-fan commented 1 year ago

Done: dealing with outliers

yim-fan commented 1 year ago

Done: seperate normal and adversarial situations

yim-fan commented 1 year ago

https://github.com/pagand/model_optimze_vessel/blob/6ce51ab10e43f105c94b3d335219dfcff3776477/Prepration/imputing_adversarial.ipynb

Outlier detection and analysis is done.
TODO: add lines for missing lines in trip, and impute missing using kalman filter

pagand commented 1 year ago

@yim-fan is the imputation is also done? also, have you considered clustering for the data?

yim-fan commented 1 year ago

The imputation is already done. As I presented in week 6 meeting, I did not get Kalman filter working for imputation, but instead, I have done a naive impute that fill in missing value according to the average of previous and latter non-missing values. It should be good enough for now. If we get time later, I will see if need to come back and work with Kalman filter to improve model performance.

The ways I impute each fields are as follows: DEPTH: impute with population mode HEADING, WIND_SPEED, WIND_SPEED_TRUE, WIND_ANGLE, WIND_ANGLE_TRUE: impute with the mean value within each corresponding trips. everything else: impute according to non missing previous and latter values. For example, if there is a missing speed, the previous sample value for speed is 800 and the latter sample value is 900, 850 will be used to fill in the missing.

The code can be found here: https://github.com/pagand/model_optimze_vessel/blob/371adc1af0885be6e3d5e09267ad77f0b41993b8/Prepration/imputing_and_outlier.ipynb

yim-fan commented 1 year ago

For clustering, I've tried K-means and DBSCAN, they do not seem to improve the model performance, so I excludes the clustered feature in the current model.

yim-fan commented 11 months ago

No need for clustering since the model performs OK so far.

pagand / model_optimze_vessel

Data preprocessing #13