mattesko / COMP550-Project

Fake news text classification project for the McGill COMP 550 Natural Language Processing course.
1 stars 1 forks source link

Develop baseline model #1

Open egproulx opened 4 years ago

egproulx commented 4 years ago

Our baseline model will be a multi class logistic regression model on the concatenation of the articles token and the claims metadata

mattesko commented 4 years ago

Can pretty much just do this:

# #### Logistic Regression

K_FOLD = 3

logistic_regression = Pipeline([
        ('vect', text.TfidfVectorizer()),
        ('scale'), preprocessing.Scale(),
        ('norm', preprocessing.Normalizer()),
        ('clf', LogisticRegression())
    ])

params = {
    'clf__penalty': ['l1', 'l2'],
    'clf__C': [0.1, 0.5, 1, 5, 10]}

grid_cv = GridSearchCV(logistic_regression, params, cv=K_FOLD)
grid_cv.fit(X_train, y_train)
print(f'Logistic Regression: \n{classification_report(y_test, grid_cv.predict(X_test))}')
print(f'Best Params: {grid_cv.best_params_}')
log_reg_params = grid_cv.best_params_
mattesko commented 4 years ago

Have it working on a notebook. 60% test F1 score. Features are TF-IDF encoding of the raw concatenation of claimant, claim, and article content

violetguos commented 4 years ago

I just tried a fully connected(FC) layer on a subset of 200 data entries, and then split train and test among the 200. I got 50% accuracy.

Process: raw text (preprocessed) -> TFIDF -> FC -> prediction

Like I mentioned in the report (copied below)

  • This was a workshop paper submitted to a similar challenge. Instead of predicting whether a claim is true, their challenge was to identify a specific kind of fake news where the headline is different than the body of text. Given a headline and a body text, they predict 4 labels, (agree’,disagree’, discuss’, orunrelated’).

    • The model uses three input features: TF and TF-DF features of a title and its body, and a cosine similarity score between the two, and a fully connected layer of neural network with RELU activations. They managed to achieve 81\% accuracy.

Possible reasons :