tOverney / ADA-Project

Project for the Applied Data Analysis course
Apache License 2.0
0 stars 0 forks source link

Switzerland and Trains

Abstract Switzerland possesses one of the densest train networks in the world.

More than 15% [^1] of people working in Switzerland commute by train; a proportion that goes over 80% for people commuting between the cities biggest cities.

We decided to dive down on the train users habits and see how our train network is effectively used. We will do this analysis with two different focuses. First we will look at the train occupancy through time and space. Secondly we will investigate the relationship between the number of trains arriving at a station and the size of the cities hosting the station.

Data description

The main data set we will use is the one provided directly by SBB (Swiss railway company covering the whole country) ^2 and we will complement it with either Wikipedia data or Swiss Government data ^3 (or even both) to get information about the cities related to the train stations. So for the first part we will use data about the train occupancy and coordinates of the relevant cities. For the second part we will need data about the train timetable to see how many connections there are, we will also need as much information about the cities in order to establish a feature list for our ml pipeline to try to predict how many connections a city will get.

Feasibility and risks

Feasibility

We took a glance at the SBB data set and it seems that the data are well organized and will not require much sanitation. We do not know exactly how/what we can fetch regarding the train use. Plus it seems that we can only gather the prediction of how full will the train be. (as in the mobile app and on their website) and not the actual occupancy. The Wikipedia Dataset is also clean should allow us to retrieve Swiss cities information (population, coordinates, etc.) quite easily. Population information could also be extracted from the data given directly from the Swiss government and Coordinates from the Google Maps API. (So we can have a lot of different sources to get our information related to the cities)

Risks

We want to make a dynamic yet readable visualization of the train occupancy depending on the time of the week (allowing you to travel through time). But none of us as any experience in doing such a custom tailored visualization.
We also do not know which ml pipeline and data features to use to predict train/hour vs. city. There is a lot of factor to account for and we fear that, depending on what we include or not, our results will not be conclusive.

Deliverables

Our main deliverable will be a single page web-app (stack not yet decided) displaying our analysis and visualization in the nicest possible way. We will also provide you with all our source code (through this repository) including the app itself and all the preprocessing done to obtain the different datasets required to display our visualization (including the ml pipeline).

Timeplan

This is the part where we are the least sure about. We know the line we want to follow but there are too many unknown variables for us to be sure about how it will unfold.

That being said here is a plan on how we roughly see things going:

[^1]: Swiss Info: Les Suisses se déplacent toujours plus loin pour aller au travail

Result

At the end project has been more a data wrangling project than a data visualization project. But here is a snippet of the result.

A live version can be found at swisstrains.overney.org.

You can also run the program locally if you follow the aggregator's README

Since the presentation, we renabled the google maps controls to zoom and move around.

Followed Pipeline

Data sources

Data wrangling

Data processing

Data visualization