A PD2 report template that incorporates all formatting necessary for students in the course, so that they can focus on the actual content instead of worrying about less important details.
Python is a great language that is easy to learn and has an amazing community
Good community means better documentation and better packages
scikit and gensim are two tools that are often used and provide a lot of great reusable code that has been well tested
Industry examples, Yelp and Bloomberg both have used scikit learn to prototyped code
I will use my project with Dr. Rumi Chunara to go over why we choose python over other languages like Java/C/Scala
Analysis
Experimenting
ipython notebooks
Preprocessing
Not all data comes in a nice format, we need to write code that takes in the raw data and prepares it for transformations. This is easily abstracted in the form of Pipelines and Transformers which allow us to write very modular, reusable pieces of code.
Pipelines and Featurizers
Multitude of Classifiers
GridSearch
Metrics and Model Evaluation
Cons
Conclusions
Python is good for medium sized data, a lot of tools are available for larger data but once data gets huge, maybe just maybe something else will be needed
We have access to HPC which allows strong computation
Python provides end to end support for the project
Recommendations
Prototype in Python, its highly productive
Doesn't mean Python isn't meant for production, there are tonnes of tools. pandas dask sklearn gensim
Table of Contents
Executive Summary
Introduction
Analysis
Conclusions
Recommendations