jxnl / pd2report

A PD2 report template that incorporates all formatting necessary for students in the course, so that they can focus on the actual content instead of worrying about less important details.
0 stars 0 forks source link

Analysis Topics #3

Closed jxnl closed 9 years ago

jxnl commented 9 years ago

Exploration

popular language with great tools
screen shot 2015-08-22 at 5 12 08 pm screen shot 2015-08-22 at 5 10 57 pm

It is important to note that with the split between python 2.7 and 3.4 various aspects of the python ecosystem still sit in 2.7. Its recommended that future work done in 3.4 and developed in virtual environments in order to maintain compatibility. However, luckily a majority of the packages we use for pipelineing have all been ported to 3.4

jxnl commented 9 years ago

Processing, pipelines, and classifiers

Preprocessing
Pipelining
class ItemGetter(BaseEstimator, TransformerMixin):
    """
    ItemGetter
    ~~~~~~~~~~
    ItemGetter is a Transformer for Pipeline objects.
    Usage:
        Initialize the ItemGetter with a `key` and its
        transform call will select a column out of the
        specified DataFrame.
    """

    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        pass

    def transform(self, X, y=None):
        return X[self.key]

    def fit_transform(self, X, y=None, **fit_params):
        return self.transform(X)

By using this API, we can have complex structures in very few lines of code

Pipeline([
     ("features", FeatureUnion([
           ("text", Pipeline([
                            ("get", ItemGetter("text")),
                            ("tfidf", TfidfTransformer()),
                            ("lsi", TruncatedSVC())  
                      ])),
           ("user", Pipeline([
                            ("get", ItemGetter("user")),
                            ("network", NetworkFeatures())
                       ]))
     ])
])

scikit

Pipeline([
     ("features", FeatureUnion([
           ("text", Pipeline([
                            ("get", ItemGetter("text")),
                            ("tfidf", TfidfTransformer()),
                            ("lsi", TruncatedSVC())  
                      ])),
           ("user", Pipeline([
                            ("get", ItemGetter("user")),
                            ("network", NetworkFeatures())
                       ])),
     ]).
    ("classifier", LinearRegression())
])

the context of machine learning, hyperparameter optimization or model selection is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent data set. Often cross-validation is used to estimate this generalization performance. Hyperparameter optimization contrasts with actual learning problems, which are also often cast as optimization problems, but optimize a loss function on the training set alone. In effect, learning algorithms learn parameters that model/reconstruct their inputs well, while hyperparameter optimization is to ensure the model does not overfit its data by tuning, e.g., regularization.

  • While other frameworks require custome gridsearch code to be developed and potentially complex methods of injecting paramters into models, by taking advantage of pythonic metaprogramming we simply need to specify a parameter grid and the pipeline.
params = {
    'clf__C': uniform(0.01, 1000),
    'features__text__tfidf__analyzer':['word', 'char'],
    'features__text__tfidf__lowercase': [False, True],
    'features__text__tfidf__max_features': list(range(10000, 100000, 1000)),
    'features__text__tfidf__ngram_range': list(n_grams(3, 14)),
    'features__text__tfidf__norm': ['l2']
}
clf = RandomizedSearchCV(pipeline, params, n_iter=60, n_jobs=4, verbose=1, scoring="f1")
clf.fit(X_train, y_train)

With this in mind, to build a baisc text classification can be extremely simple from start to finish.

X, y = get_data()
pipeline = Pipeline([
  ("tfidf", TfidfTransformer()),
  ("lsi", TruncatedSVD()),
  ("clf", LogisticRegression())]
)
pipeline.fit(X, y)
Technical Debt
jxnl commented 9 years ago

Model Evaluation

Performance Metrics
Model Inspection
jxnl commented 9 years ago

Exiting Python

exploration
pipeline design
vowpal wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

  • due to these efficiency tricks, vowpal is the king of large data, single machine problems.
  • however its api is less than simple. using primarily cli tools
  • with respect to actually using vowpal wabbit, where does exist a python API and the tutorials had this to say

This tutorial walks you through writing learning to code using the VW python interface. Once you've completed this, you can graduate to the C++ version, which will be faster for the computer but more painful for you.

  • however as a consequence, we lose our ability to easily develop integrated pipelines and grid search
  • at the cost of developer time, we gain computing performance
  • lastly vowpal wabbit simply just not support all the algorithms that we may want to use, the same tris kit uses for efficiency are only available on a small subset of machine learning problems.
Spark MLLib