Idea is to have autoencoder trained on entire dataset and take its middle layer (embedding) as a feature to later train classifier on the train set. This is really time consuming and we'll have to use out-of-the box solution for some transformations.
Read before starting the issue:
autoencoders are trained to reproduce the input, this means we can use entire train+test set for training, lets split it 90%train, 10% validation and use cross-validation to compare results
drop meaningless and null columns from metadata before training
after initial experiments on just the metadata, add more features to the training (all of the dask aggregations, for example)
train a chosen model (xgb/lgb) and check whether it improves cv scores. Select few (3-5) most promising sets of added features, generate predictions from them and upload to kaggle to check how it changes the score.
Time series feature extraction:
requires time all time series to be equal length (for each object_id and across various object_ids) - this means we need to wait for clustering results in #50 to divide dataset into clusters of similar objects. Imputing data to all be the same should be done via a random strategy, as suggested in kaggle discussions (don't remember which one right now)
train a chosen model (xgb/lgb) on metadata features with appended time series embeddings, check whether it improves cv scores. Select few (3-5) most promising sets of added features, generate predictions from them and upload to kaggle to check how it changes the score
Idea is to have autoencoder trained on entire dataset and take its middle layer (embedding) as a feature to later train classifier on the train set. This is really time consuming and we'll have to use out-of-the box solution for some transformations.
Read before starting the issue:
Metadata feature extraction:
Architectures: