Develop baseline model - Githubissues

egproulx commented 5 years ago

Our baseline model will be a multi class logistic regression model on the concatenation of the articles token and the claims metadata

mattesko commented 5 years ago

Can pretty much just do this:

# #### Logistic Regression

K_FOLD = 3

logistic_regression = Pipeline([
        ('vect', text.TfidfVectorizer()),
        ('scale'), preprocessing.Scale(),
        ('norm', preprocessing.Normalizer()),
        ('clf', LogisticRegression())
    ])

params = {
    'clf__penalty': ['l1', 'l2'],
    'clf__C': [0.1, 0.5, 1, 5, 10]}

grid_cv = GridSearchCV(logistic_regression, params, cv=K_FOLD)
grid_cv.fit(X_train, y_train)
print(f'Logistic Regression: \n{classification_report(y_test, grid_cv.predict(X_test))}')
print(f'Best Params: {grid_cv.best_params_}')
log_reg_params = grid_cv.best_params_

mattesko commented 5 years ago

Have it working on a notebook. 60% test F1 score. Features are TF-IDF encoding of the raw concatenation of claimant, claim, and article content

violetguos commented 5 years ago

I just tried a fully connected(FC) layer on a subset of 200 data entries, and then split train and test among the 200. I got 50% accuracy.

Process: raw text (preprocessed) -> TFIDF -> FC -> prediction

Like I mentioned in the report (copied below)

This was a workshop paper submitted to a similar challenge. Instead of predicting whether a claim is true, their challenge was to identify a specific kind of fake news where the headline is different than the body of text. Given a headline and a body text, they predict 4 labels, (agree’,disagree’, discuss’, orunrelated’).

The model uses three input features: TF and TF-DF features of a title and its body, and a cosine similarity score between the two, and a fully connected layer of neural network with RELU activations. They managed to achieve 81\% accuracy.

Possible reasons :

training set too small
I just used sklearn's neural network class, which is very simple and we cannot try things like gradient clipping, fancy optimizers (like Adam)
TF IDF itself is not sufficient. The solution would be to use Pytorch where we can customize (read: need to build stuff from scratch). Possibly we should try more powerful features, such as probabilistic word embeddings. Example here

mattesko / COMP550-Project

Develop baseline model #1