redhairedcelt / Final-Project-Group1

ML2 Final Project for Group 1
MIT License
1 stars 1 forks source link

Final-Project-Group1

ML2 Final Project for Group 1

Introduction and Project Summary

Our group was interested in exploring Recurrent Neural Networks (RNNs), so we chose a project focused on predicting the next value in a given sequence. Specifically, we analyzed 3.6 million US airline flights over a period of 6 months. Each record in our dataset includes the airline name, a unique identifier for each aircraft, the origin and destination airports, and the time of the flight. We defined a flight as one unique aircraft flying from one of about 350 airports to another. Our research question is “Given a sequence of N airports visited by a unique plane, can we predict the next airport (N+1)?”

We believe this approach is applicable to numerous different problems beyond predicting the next airport. Our world is awash in devices that record their time and location, leading to an explosion of geospatial data that can inform everything from advertising to pandemic responses. Often, the data is too dense for traditional geospatial analysis methods and a common approach is to represent a dataset as a network or sequence of known locations visited. Applying a similar methodology as the one proposed will likely lead to additional insights in multiple fields and business cases.

Code Summary

To facilitate efficient model development, training, and evaluation in an environment with multiple different versions of input datasets, we developed a data ingest and preprocessing pipeline using a series of Python scripts, which are detailed below. This repo is intended to be cloned to the root directory of the user, and all directory paths are hard codes to look for the repo main folder at "~/".

To manage the different types of models and versions of our data, which can be segmented by airline and divided into different length sequences in the model pipeline, we used different “model_name” and “run_name” variables to track activity across different RNN models (LSTM, GRU, seq2seq, etc) and airline/sequence length “runs” respectively. These variables for saving and loading data and models are found at the beginning of each script.

Running this Code

The repo is configured to run a baseline model with a two-layer LSTM for Delta Airlines with a sequence length of 50. If the repo is cloned at the root file '/home/ubuntu', a user must first run the EDA_and_cleaning.py script in '/home/ubuntu/Final-Project-Group1/Code/baseline_scripts' to install any needed Python packages and unzip the Data directory. After that, a user can run any of the scripts in the '/home/ubuntu/Final-Project-Group1/Code/baseline_scripts' in any order to explore a baseline model for Delta Airlines with a sequence length of 50. Running additional models will require generation of some of those models and required data first. Please see flow chart below for additional details on data processing.

Data Processing Overview

EDA_and_cleaning.py:

ACF and seasonality.py

preprocessing.py:

modeling.py:

modeling_seq2seq.py

model_evaluate.py: