niclaswue / ATOW-Prediction

https://ansperformance.eu/study/data-challenge
GNU General Public License v3.0
1 stars 1 forks source link

TODO collection for future reference #6

Open wurzelDeveloper opened 1 month ago

wurzelDeveloper commented 1 month ago

TODOs:

Add features to improve the prediction. All multiplications and divisions should be features.

Feature ideas:r

further improvements

k-fold cross validation to get a good signal without waiting for the leaderboard

clean data remove quasi duplicates with same tow

https://www.transtats.bts.gov/AverageFare/ https://www.transtats.bts.gov/Data_Elements.aspx?Data=4 https://data.europa.eu/data/datasets/43c6ugqwp92dx7vlgnzja?locale=en Diversion Airports: https://www.bts.gov/topics/airlines-and-airports/domestic-flights-tarmac-times-more-3-hours-and-international-flights-9 Taxi out time: https://www.transtats.bts.gov/ONTIME/OriginDestination.aspx https://www.transtats.bts.gov/ONTIME/Departures.aspx

https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr https://www.bts.gov/browse-statistical-products-and-data/bts-publications/data-bank-28ds-t-100-domestic-segment-data

How to split into regional, buisness, cargo,... https://www.eurocontrol.int/sites/default/files/2022-05/eurocontrol-market-segment-update-2022-05.pdf

https://www.easa.europa.eu/eco/eaer/appendix

fuel flow per engine etc. https://www.easa.europa.eu/en/domains/environment/icao-aircraft-engine-emissions-databank

fuel prices 2022 https://www.kaggle.com/datasets/zusmani/petrolgas-prices-worldwide/data CC0 license - obtained with google

https://data.transportation.gov/Aviation/Consumer-Airfare-Report-Table-1a-All-U-S-Airport-P/tfrh-tu9e/about_data

https://destinationinsights.withgoogle.com/intl/en_ALL/

Likely not allowed to use, but could be checked if results improve dramatically => warrants manual collection of more data https://www.kaggle.com/datasets/heitornunes/aircraft-performance-dataset-aircraft-bluebook/data

Idea from Lukas: Add tax dataset?

Idea for a general plan:

  1. We go very broad and extensively leverage open datasets
  2. We train a small model to find out which features are most important
  3. For these features, we do feature engineerin or include additional datasets from the domain
  4. We clean the existing data and features to reduce noise
  5. We scale up the model to the biggest possible size