jeffheaton / encog-java-core

http://www.heatonresearch.com/encog
Other
744 stars 268 forks source link

Analyst sometimes fails to detect min/max #178

Open ekerazha opened 9 years ago

ekerazha commented 9 years ago

I'm looking at this Kaggle competition: http://www.kaggle.com/c/titanic-gettingStarted/data

This is the training file: http://www.kaggle.com/c/titanic-gettingStarted/download/train.csv (CSV)

I tried to use this simple code (mostly taken from the NormalizeFile example):

public class Titanic {

static final String TRAIN_FILE = "train.csv";
static final String TRAIN_FILE_NORM = "train_norm.csv";

public static void dumpFieldInfo(EncogAnalyst analyst) {
    System.out.println("Fields found in file:");
    for (AnalystField field : analyst.getScript().getNormalize().getNormalizedFields()) {

        StringBuilder line = new StringBuilder();
        line.append(field.getName());
        line.append(",action=");
        line.append(field.getAction());
        line.append(",min=");
        line.append(field.getActualLow());
        line.append(",max=");
        line.append(field.getActualHigh());
        System.out.println(line.toString());
    }
}

public static void main(String[] args) {
    File trainFile = new File(TRAIN_FILE);
    File trainFileNorm = new File(TRAIN_FILE_NORM);

    EncogAnalyst analyst = new EncogAnalyst();
    AnalystWizard wizard = new AnalystWizard(analyst);
    wizard.wizard(trainFile, true, AnalystFileFormat.DECPNT_COMMA);

    dumpFieldInfo(analyst);

    final AnalystNormalizeCSV norm = new AnalystNormalizeCSV();
    norm.analyze(trainFile, true, CSVFormat.ENGLISH, analyst);
    norm.setProduceOutputHeaders(true);
    norm.normalize(trainFileNorm);

    Encog.getInstance().shutdown();
}

}

Output is:

Fields found in file:
passengerid,action=Normalize,min=1.0,max=891.0
survived,action=OneOf,min=0.0,max=0.0
pclass,action=Equilateral,min=0.0,max=0.0
name,action=Ignore,min=0.0,max=0.0
sex,action=OneOf,min=0.0,max=0.0
age,action=Normalize,min=0.42,max=80.0
sibsp,action=Equilateral,min=0.0,max=0.0
parch,action=Equilateral,min=0.0,max=0.0
ticket,action=Ignore,min=0.0,max=0.0
fare,action=Normalize,min=0.0,max=512.3292
cabin,action=Ignore,min=0.0,max=0.0
embarked,action=Equilateral,min=0.0,max=0.0

Look at the "parch" column:

  1. It wants to normalize it as "Equilateral", but it's the "Number of Parents/Children Aboard", so I think that "Normalize" should be a better choice (no problem, I can change this).
  2. It says that "min=0.0,max=0.0", but max should be about 5.

If you look at this line of the training file:

886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39,0,5,382652,29.125,,Q

we have that "parch" is definitely 5.

If I change the normalization method for the "parch" column from Equilateral to Normalize

analyst.getScript().getNormalize().getNormalizedFields().get(7).setAction(NormalizationAction.Normalize);

it still fails to detect the max value.

I also tried

analyst.getScript().getNormalize().setMissingValues(new MeanAndModeMissing());

because I thought it could fail to find the max value because of missing values, but it still fails to detect the max value, I always get "min=0.0,max=0.0".

P.S. It also wants to normalize "survived" and "sex" as "OneOf", but we only have 2 values (0/1, male/female), so I think that "SingleField" normalization could also be a good choice (I can change this, however it uses 0/1 instead of the full -1/1 range for the SingleField values... I don't know if it works this way by design...).

VelkyTlustoch commented 7 years ago

I stumbled upon this issue while working on the same Kaggle competition (but using c#, so the issue is likely present there as well).

After some tinkering, I found out that it's caused by columns subject to normalization that contain solely integer values. When I tweaked the CSV accordingly (added .0 to the integer values, but I suspect you only need to change a single value like that for each of the offending columns for it to work), it worked like charm again.

I'm a bit busy at the moment (no time to trawl through the code and find the cause of this) and quite new to this Git stuff (I registered just to report this), so I probably won't be able to fix this, though.