The loss doesn't decrease on the KDD12 dataset

YeWenting commented 5 years ago

Hi, I have run the code on the KDD12, but the loss didn't decrease even after reading over all data. I have also tried to increase the D to 1 << 26 - 1 (paper mentions "The size of the MISSION and FH models were set to the nearest power of 2 greater than the number of features in the dataset"). But it doesn't help.

Any suggestion will be greatly appreciated!

YeWenting commented 5 years ago

By the way, I notice that the model only train the model for one epoch, is it ok?

rdspring1 commented 5 years ago

Did you measure its AUC performance after 1 training epoch?

YeWenting commented 5 years ago

I cannot find AUC score. All I can see is many "0.5 \n ". Should I somehow store the weights and call other scripts?

rdspring1 commented 5 years ago

Does the 0.5 refer to the loss function or the model's output? Also, did you tune the learning rate? You need to feed the model's output into sklearn.metrics.auc function.

YeWenting commented 5 years ago

Could you provide the hyperparameters (for examples, D, learning rate, TOPK) for the KDD12 so that we could reproduce the results?

rdspring1 commented 5 years ago

File: mission_logistic.cpp Hyperparameters:

Ln 22 - const size_t TOPK = (1 << 20) - 1;
Ln 28 - const size_t D = (1 << 26) - 1;
Ln 34 - const float LR = 5e-1;
Ln 46 - const size_t THREADS = 16;
Ln 166 - mp_queue<x_t> q(1000000);

Key Change: Ln 94 The KDD2012 dataset uses 0/1 labels, while other logistic regression datasets can use -1/1 labels. You need to change the label format to train correctly.

float label = (x.first + 1.0f) / 2.0f; // -1/1 labels - Old
float label = x.first; // 0/1 labels - New

Results: Training Time: 11m 46s AUC Score: 0.750939

rdspring1 commented 5 years ago

AUC Score Script

import numpy as np
from sklearn import metrics
import sys

def auc(filename):
    target_list = list()
    pred_list = list()
    with open(filename) as f:
        for line in f:
            left, right = line.split()
            target_list.append(float(left))
            pred_list.append(float(right))

    targets = np.asarray(target_list)
    pred = np.asarray(pred_list)
    auc_score = metrics.roc_auc_score(targets, pred)
    print("AUC: {:.6f}".format(auc_score))

filename = sys.argv[1]
print('=' * 89)
print("Testing", filename)
auc(filename)
print('=' * 89)

YeWenting commented 5 years ago

Thank you very much! It's super helpful!

haoransh commented 5 years ago

Hi, I'm wondering about the specific experiment settings. In KDD2012 dataset, the number of original feature dimension is 54,686,452 (shown in Table 2 of your paper). You const size_t D = (1 << 26) - 1 = 67108863. It seems that MISSION doesn't reduce the working memory, since the number of memory slots is even larger than the number of original features? Also in the Experimental settings paragraph in your section 6.1, you mentioned that The size of the MISSION and FH models were set to the nearest power of 2 greater than the number of features in the dataset. I feel confused about why you could set the size of the CM structure greater than the number of features? Why not just load the complete feature vector into memory if so?

I think there is something I'm missing maybe. Could you help to clarify this point? Thanks!

rdspring1 commented 5 years ago

I would consider the KDD2012 dataset a toy problem. D = 67,108,863 is roughly 268 MB of memory. A modern machine has gigabytes or terabytes of memory. It is trivial to allocate memory for all the features.

Now, consider the main problem of the paper - DNA Metagenomic classification.

We want to classify 193 genomes using length 12 strings.
There are about 17.3 million unique strings for this problem.
12 characters 17.3 million 193 classes * 4 bytes per float = 160 GB string dictionary
VW was limited to 2^31 features or about 8.6 GB of memory.
MISSION tracked the top 2.5 million k-mers for 193 classes = 23.2 GB memory.
VW saved about ~18x memory while MISSION saves about ~5x memory.
Feature Hashing is more efficient than maintaining a string dictionary.

Note, the 193 genome metagenomics task comes from a small reference database. The large reference database contains 774 genomes with even more unique strings.

rdspring1 / MISSION

The loss doesn't decrease on the KDD12 dataset #3