Exploration

popular language with great tools

One obvious reason to use python as the base language is that its popular in the open source world
- Generally this means that a lot of packages have been written and open source.
- for example: awesome python
- not only are they open but a larger community means more people who you hire will know python
lot of support exists for the tools and language in general
pretty important for technical debt is that if you have multiple cycling in and out of the project, python can be a common thread.
ipython notebooks
One major contribution for using python for academica is the ipython notebook, something that should not be overlooked
IPython Notebook
By providing a "workspace" for developing code, ipython notebooks support reproducible research by having one-to-one input outputs than can be run with any machine that has the correct packages.
Its support for markdown and LaTeX also allow is to document and explain what code is doing in a more intuitive way.
notebooks have now become the base tool for teaching in the python communities.
- So by using notebooks to do the most of your pipeline design, it can be easily passed on as examples for later lab members.
- Mining the Social Web has done a great job of presenting the entire education book as a series of notebooks like this
  Data Visualization super easy.
Before jumping into building models its important to explore the data in front of you

Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.
With the presence of ipython notebooks, tools like pandas and seaborn give us a whole range of transformations and visualizations to play with right out of the box.

These visualizations are incredibly useful for understanding the relationships between our data and allow us ti generate new ideas on how to design our models.
- Python allows us to do this effortlessly
  Issues

It is important to note that with the split between python 2.7 and 3.4 various aspects of the python ecosystem still sit in 2.7. Its recommended that future work done in 3.4 and developed in virtual environments in order to maintain compatibility. However, luckily a majority of the packages we use for pipelineing have all been ported to 3.4

Processing, pipelines, and classifiers

Preprocessing

Similar to R, python's pandas package provides a fluit DataFrame api for doing data manipulation
python has also been known for good data manipulation, better than Java or C in terms of developer efficiency.
- If a lot of efficiency is required, there exists C bindings for a lot of routine operations that exist in Cython as Pandas and Numpy are all backed by Fortran and C

Pipelining

it's also important to remember that real life data does not come in a perfect format that is easily ingested by a machine learning algorithm.
- While algorithms usually ingest data in that comes from some R^n, in real life that is not the case.
Consider our task of text classification.
- We need to turn the text data into normalized text, then include various features that contain the user data, time information along with tfidf-lsi features generated from the text, then that needs to be converted into a sparse matrix for the classifier.
By using the simple Pipeline api, we are able to compose very complex analytics structures in a modular and reusable way.
- this is done by implementing class that inherits from BaseEstimator, TransformerMixin and implementing fit and transform

class ItemGetter(BaseEstimator, TransformerMixin):
    """
    ItemGetter
    ~~~~~~~~~~
    ItemGetter is a Transformer for Pipeline objects.
    Usage:
        Initialize the ItemGetter with a `key` and its
        transform call will select a column out of the
        specified DataFrame.
    """

    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        pass

    def transform(self, X, y=None):
        return X[self.key]

    def fit_transform(self, X, y=None, **fit_params):
        return self.transform(X)

By using this API, we can have complex structures in very few lines of code

Pipeline([
     ("features", FeatureUnion([
           ("text", Pipeline([
                            ("get", ItemGetter("text")),
                            ("tfidf", TfidfTransformer()),
                            ("lsi", TruncatedSVC())  
                      ])),
           ("user", Pipeline([
                            ("get", ItemGetter("user")),
                            ("network", NetworkFeatures())
                       ]))
     ])
])

The above line of code converts a JSON object with a "text", "user" field and converts it into a sparse matrix that has been factorized.
The benefit of scikit learn as well is that unlike other language packages, scikit has a huge collection of pre implemented state of the art classifiers that rely on well implemented open source tools like LibSVM. and LibLinear.

scikit

moreover, adding a classifier is easy as adding the class to the end of the pipeline.

Pipeline([
     ("features", FeatureUnion([
           ("text", Pipeline([
                            ("get", ItemGetter("text")),
                            ("tfidf", TfidfTransformer()),
                            ("lsi", TruncatedSVC())  
                      ])),
           ("user", Pipeline([
                            ("get", ItemGetter("user")),
                            ("network", NetworkFeatures())
                       ])),
     ]).
    ("classifier", LinearRegression())
])

Lastly, and the most important is gridsearch.

the context of machine learning, hyperparameter optimization or model selection is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent data set. Often cross-validation is used to estimate this generalization performance. Hyperparameter optimization contrasts with actual learning problems, which are also often cast as optimization problems, but optimize a loss function on the training set alone. In effect, learning algorithms learn parameters that model/reconstruct their inputs well, while hyperparameter optimization is to ensure the model does not overfit its data by tuning, e.g., regularization.

While other frameworks require custome gridsearch code to be developed and potentially complex methods of injecting paramters into models, by taking advantage of pythonic metaprogramming we simply need to specify a parameter grid and the pipeline.

params = {
    'clf__C': uniform(0.01, 1000),
    'features__text__tfidf__analyzer':['word', 'char'],
    'features__text__tfidf__lowercase': [False, True],
    'features__text__tfidf__max_features': list(range(10000, 100000, 1000)),
    'features__text__tfidf__ngram_range': list(n_grams(3, 14)),
    'features__text__tfidf__norm': ['l2']
}
clf = RandomizedSearchCV(pipeline, params, n_iter=60, n_jobs=4, verbose=1, scoring="f1")
clf.fit(X_train, y_train)

With this in mind, to build a baisc text classification can be extremely simple from start to finish.

X, y = get_data()
pipeline = Pipeline([
  ("tfidf", TfidfTransformer()),
  ("lsi", TruncatedSVD()),
  ("clf", LogisticRegression())]
)
pipeline.fit(X, y)

Technical Debt

By using pandas dataframes and implementing scikit learn pipeline elements we produce a lot of maintainable and modular code than can be reused in other projects.
moreover, by using this common api, the code we produce is more self documenting.
By having these useful abstractions like FeatureUnions, Pipelines we reduce technical debt by not having to implent custom scripts to do this kind of work
- allowing us to focus on modeling rather than coding.
- more time building rather than just writing scaffolding code.
- Also allows less experienced python developers try things out.

Model Evaluation

Performance Metrics

Once we've built various models we want to find the best ones
- sklearn.metrics contains a tonne of different ways to measure performance.
- for example, consider the problem of finding cancer. in 200 data points, you'll get once case of cancer. if our algorithm was simply "say everyone does not have cancer" then on average you'd have an accuracy of 99.5!, so surely other methods must exist that are more robust.
- at work we often use f1_score, which is a combination of precision and recall.
- instead of looking at just what we got right and wrong, we also take into consideration the number of false positives and false negatives.
Moreso, there are easy tools to visualize different properties like confusion matrices, or auc

Model Inspection

one downfall, that is presence not only in scikit but most machine learning frameworks is model inspection
after the mapping between inputs and outputs are learned, it is very hard to understand feature importances or hypothesis test significance
- this question sets up a divide between more statistical methods rather than black box machine learning. sometimes significance values are less important.
- if they are, different models and tools can be used to answer these questions
- for example, currently in my research i'm simply producing a table of different feature combinations and their f1 scores

Exiting Python

We've discussed how python's ipython notebooks, pandas, and scikit learn does the three things well
1. exploration
2. pipeline design
3. evaluation
in this section we will discuss
- some downfalls, how to avoid them,
- other solutions that may exist out of python

exploration

honesty ipython notebooks not only have changed how python is used in the work place but actually how a lot of languages work
after the success of ipython notebooks, large communities have formed to extend this into different languages to form the jupyter which support R, Spark and Lua all fantastic tools for data analysis.
by learning the fundamentals of thinking with Notebooks, the design principals we learn to use while developing with notebooks in mind will help us move forward with different languages and frameworks

pipeline design

scikit's pipeline api has been so successful that bigger projects like Spark are designing its apis to parallel scikit's own
However the pitfalls of scikit, and python in general is its parallization.
- without a lot of investments in developing custom tools, training on massiev datasets (1M rows) is nearly impossible
however, tools like vowpal wabbit and spark's mllib offer solutions at the cost of simplicity.

vowpal wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

due to these efficiency tricks, vowpal is the king of large data, single machine problems.

however its api is less than simple. using primarily cli tools

with respect to actually using vowpal wabbit, where does exist a python API and the tutorials had this to say

This tutorial walks you through writing learning to code using the VW python interface. Once you've completed this, you can graduate to the C++ version, which will be faster for the computer but more painful for you.

however as a consequence, we lose our ability to easily develop integrated pipelines and grid search

at the cost of developer time, we gain computing performance

lastly vowpal wabbit simply just not support all the algorithms that we may want to use, the same tris kit uses for efficiency are only available on a small subset of machine learning problems.

Spark MLLib

Spark MLLib is another alternative
Although its less "efficient" MLLib is able to process much larger scales of data
Moreover, inspired my scikit learn itself, Spark MLlib has recently begun development of pipelines and soon to gridsearch

jxnl / pd2report

Analysis Topics #3