janhohenheim / Hippocrates

No longer maintained, actually usable implementation of NEAT
GNU Affero General Public License v3.0
63 stars 12 forks source link

Add some tutorial for unsupervised learning #76

Open ddovod opened 7 years ago

ddovod commented 7 years ago

Hi! I have some problem with understanding of unsipervised learning api (IBody class). Could you please provide some information about it? Tutorial section or documentation for this class would be nice! Thank you a lot!

jeremystucki commented 7 years ago

We are currently working on a better documentation. As for now I added a quick draft that should help you.

Feel free to ask any questions.

We will keep this issue open until we have a better documentation.

ddovod commented 7 years ago

Yes, this is what I'm looking for! I'm the very beginner in this things and any info is useful. And thank you guys for this project, seems to be very clean and useful!

ddovod commented 7 years ago

Another question about performance. Is it okay if supervised learning lasts a lot of time with simple iris dataset on a core i5 6600 cpu? Didn't wait till it finish, but it lasted for about 15+ minutes

janhohenheim commented 7 years ago

Which Dataset?
Can you send us your categorization code?

It definitely shouldn't take that long.

Mafii commented 7 years ago

@ddovod If you're using Visual Studio, did you compile in Debug or Release mode? Running Hippocrates in Debug mode will reduce its performance by a big margin. Just re-compile in Release mode and try it again.

ddovod commented 7 years ago

I'm using cmake and after set(CMAKE_BUILD_TYPE Release) it runs faster on simple tasks, but on the iris dataset it still lasts for a long time. I'm using classic dataset from here https://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-learning-and-statistics-spring-2012/datasets/iris.csv, just replacing last column with digital class marks. My code is:

void loadIris(Training::Data<IrisResult>& trData, Training::Data<IrisResult>& testData)
{
    std::vector<std::vector<float>> data;
    std::ifstream file("iris.csv");
    std::string buf;
    while (std::getline(file, buf)) {
        data.push_back({});
        std::stringstream ss(buf);
        float val;
        while (ss >> val) {
            data.back().push_back(val);
            if (ss.peek() == ',')
                ss.ignore();
        }
    }

    std::random_shuffle(data.begin(), data.end());

    for (int i = 0; i < data.size() * 0.8; i++) {
        Training::Data<IrisResult>::Set set;
        set.input = std::vector<float>(data[i].begin(), data[i].end() - 1);
        set.classification = static_cast<IrisResult>(std::round(data[i].back()));
        trData.AddSet(set);
    }
    for (int i = data.size() * 0.8; i < data.size(); i++) {
        Training::Data<IrisResult>::Set set;
        set.input = std::vector<float>(data[i].begin(), data[i].end() - 1);
        set.classification = static_cast<IrisResult>(std::round(data[i].back()));
        testData.AddSet(set);
    }
}

int main()
{
    Training::Data<IrisResult> trData;
    Training::Data<IrisResult> testData;
    loadIris(trData, testData);

    Training::NeuralNetworkTrainer trainer;
    auto champ = trainer.TrainSupervised(trData, 150);
    std::cout << "Finished training in " << trainer.GetGenerationsPassed() << " generations\n";
    std::cout << "Result: " << Tests::TestingUtilities::TestNetwork(champ, testData) << std::endl;
    return 0;
}

It lasts for a 20 minutes or so and doesn.t finished yet.

I've been trying this code on a reduced version of this dataset (20 random objects for training, 10 - for testing), and this is my output:

ddovod@/build: time ./neat
Finished training in 4693 generations
Result: 1

real    1m2.466s
user    0m42.032s
sys 0m3.176s

Maybe I'm doing something wrong? Thank you a lot!

janhohenheim commented 7 years ago

I'm gonna look at this in more detail today, but at first glance it seems that your inputs are not between -1.0 and 1.0, which is asumed by our library. The intended usage would be to divide by the theorecally highest value, like in here.

But uppon thinking about this I decided that this is not a clean solution and your code should be able to work. I'm gonna change the lib accordingly in the next few hours (#77). I would appreciate it if you could wait a moment and not change your code so you can beta test the new feature.

ddovod commented 7 years ago

Yes, of course, I can check it on the evening today

ddovod commented 7 years ago

I have divided all the values by 10.0 except the class marks and it still consumes a lot of time. For reduced dataset (20 training/10 testing) I have the following output:

ddovod@/build: time ./neat
Finished training in 3417 generations
Result: 1

real    0m48.722s
user    0m48.696s
sys 0m0.012s

Is it okay to produce 3417 generations for this dataset? Maybe you have some numbers for classic problems, i.e. "For binary classification with 100 training objects 500 generations should be enough"? It will be very helpful

janhohenheim commented 7 years ago

Thank you for your feedback, know that it means a lot to us!

I have just updated the development branch so your original code without division should work.
Would you mind sharing your new results with us? Let's hope they're a bit faster this time :)

If there are no visible improvements I'm gonna implement #81 and then compare the results.

ddovod commented 7 years ago

Ok, it works much faster now! Reduced dataset (faster and without errors):

ddovod@/build: time ./neat 
Finished training in 285 generations
Result: 0

real    0m0.587s
user    0m0.584s
sys 0m0.000s

But with full dataset (120/30) there's a lot of wrong answers (maybe it is overfitting, afaik ANNs are very overfittable for linear classification tasks), but it runs significant faster ("Result" here is the number of bad predictions on the test data):

ddovod@/build: time ./neat 
Finished training in 1183 generations
Result: 28

real    0m20.655s
user    0m20.652s
sys 0m0.000s

There's one more issue with NeuralNetwork class. My compiler is g++-6.2, and I compilation error:

[ 68%] Building CXX object CMakeFiles/neat.dir/src/main.cpp.o
In file included from /home/ddovod/_private/_ml/practice/neat/src/main.cpp:3:
In file included from /home/ddovod/_private/_ml/practice/neat/Hippocrates/Tests/TestingUtilities/Sources/Headers/testing_utilities.hpp:7:
In file included from /home/ddovod/_private/_ml/practice/neat/Hippocrates/Core/Sources/Headers/training/neural_network_trainer.hpp:6:
/home/ddovod/_private/_ml/practice/neat/Hippocrates/Core/Sources/Headers/trained/classifier.hpp:11:17: error: call to implicitly-deleted default constructor of 'Hippocrates::Trained::NeuralNetwork'
        Classifier() : NeuralNetwork(){ };
                       ^
/home/ddovod/_private/_ml/practice/neat/Hippocrates/Core/Sources/Headers/trained/neural_network.hpp:6:23: note: default constructor of 'NeuralNetwork' is implicitly deleted because base class 'Phenotype::NeuralNetwork' has no default constructor
class NeuralNetwork : public Phenotype::NeuralNetwork {
                      ^
1 error generated.

It can be fixed by adding the default ctor and then it works fine

ddovod commented 7 years ago

Sorry, my bad, the right full dataset result:

ddovod@/build: time ./neat 
Finished training in 439 generations
Result: 0

real    0m0.961s
user    0m0.936s
sys 0m0.000s

Looks like it works fine! Thank you a lot! I will further experiment with it and maybe will ask some dummy questions here, is it okay?)

janhohenheim commented 7 years ago

Oh wow, these results really make me proud :)

I will look to add your dataset as an integration test.
Is it fine if I use some of the code that you provided in your snipped?

We are all more then happy if you experiment around and ask silly questions. We still did not invite beta testers and so require a lot of beginner feedback.
If you have any questions on the usability or find parts of the library to be confusing, please ask.

jeremystucki commented 7 years ago

Thank you for helping us.

This test looks ideal for the project. I would like to use your code in our tests, if you are ok with that.

You could also open a pull request if you want to add the test yourself.

ddovod commented 7 years ago

Yes, sure, thank you a lot) I'm seeing some stranger things and will be investigating it. So I'll return with results a bit later in this week. Your project is very interesting for me, it almost the only maintainable NEAT-related project on the github, so I will glad if can be useful for it. And I can open a pull-resuest with iris dataset and related classification test tomorrow

ddovod commented 7 years ago

Ok guys, another question. Why did you restricted this library usage only for c++1z? There's no much place where it's really needed, and maybe c++14 would be enough, and it has good support in gcc and clang (and libc++)

ddovod commented 7 years ago

Here just a few things I cannot use to work with Hippocrates:

ddovod commented 7 years ago

And maybe it was my mistake, but it still spends a lot of time on my initial problem. I don't know why, but today I cleaned/compiled the test again and it spends a lot of time and memory https://travis-ci.org/ddovod/Hippocrates/jobs/179871731 Have no idea about the reason

janhohenheim commented 7 years ago

As you can see here, C++1z is just as well supported.
libc++ has indeed support for experimental, you just have to build it yourself and provide a certain build parameter, which is a pain in the ass. Perhaps we will change logging and JSON parsing to use well tested libraries that do not use experimental TS, which would eliminate those dependencies.

CLion has very deprecated syntax highlighting since 3 years and is not planned to be supported, as this would bottleneck our coding style heavily.

ddovod commented 7 years ago

Okay, it is not big problem I think. I just compiled anf run all tests with address sanitizer (clang-3.9 + libstdc++) and it seems ok. I want to compare the results of iris problem solution with this library. I'll post results here soon