Open egproulx opened 5 years ago
Can pretty much just do this:
# #### Logistic Regression
K_FOLD = 3
logistic_regression = Pipeline([
('vect', text.TfidfVectorizer()),
('scale'), preprocessing.Scale(),
('norm', preprocessing.Normalizer()),
('clf', LogisticRegression())
])
params = {
'clf__penalty': ['l1', 'l2'],
'clf__C': [0.1, 0.5, 1, 5, 10]}
grid_cv = GridSearchCV(logistic_regression, params, cv=K_FOLD)
grid_cv.fit(X_train, y_train)
print(f'Logistic Regression: \n{classification_report(y_test, grid_cv.predict(X_test))}')
print(f'Best Params: {grid_cv.best_params_}')
log_reg_params = grid_cv.best_params_
Have it working on a notebook. 60% test F1 score. Features are TF-IDF encoding of the raw concatenation of claimant, claim, and article content
I just tried a fully connected(FC) layer on a subset of 200 data entries, and then split train and test among the 200. I got 50% accuracy.
Process: raw text (preprocessed) -> TFIDF -> FC -> prediction
Like I mentioned in the report (copied below)
This was a workshop paper submitted to a similar challenge. Instead of predicting whether a claim is true, their challenge was to identify a specific kind of fake news where the headline is different than the body of text. Given a headline and a body text, they predict 4 labels, (
agree’,
disagree’,discuss’, or
unrelated’).
- The model uses three input features: TF and TF-DF features of a title and its body, and a cosine similarity score between the two, and a fully connected layer of neural network with RELU activations. They managed to achieve 81\% accuracy.
Possible reasons :
Our baseline model will be a multi class logistic regression model on the concatenation of the articles token and the claims metadata