Optimizing as-moses: Reports

Eman22S commented 4 years ago

Overview

This is to report on the comparisons that were made between moses and asmoses and the results that were found based on issue #69.

Benchmark

For comparison purposes we used demo-problems, dataset found in the unit tests as well as 2 external dataset.

Results on comparison made using demo problems-multiplexer problem:

part4

Results on comparison made using dataset(iris.data and IrisSetosa) in unit tests:

irse-screen-shot In running these datasets I have noticed two issues: One, Moses can't run asmoses -i datasets/IrisSetosa.data -m10000 -uCLASS if the columns in IrisSetosa.data are rearranged. But if the columns in iris.data dataset are rearranged, running asmoses -i datasets/iris.data -m10000 -uclass runs with no error. Please note that the target class in IrisSetosa is boolean, where as in iris.data its Enum. Even though, our task is to simply compare the performance of moses with and without the--atomspace-port=1 tag whenever the command works, it might be a good thing to report some of the inconvenient ways moses fails to run like the above scenario. Second, Both datasets did not run when the flag atomspace-port=1 was added, the error is returned by the combo_atomese converter that do not support the greater than zero operator yet. For the time being we are working on binarization of all columns in the datasets to boolean to trick moses into not generating the operator.

Results on comparison made using external dataset(Cowles.data and Melanoma.csv) :

external-dataset Again we binarized these two datasets into all boolean columns since moses with and without atomspace-port=1 can't interpret the original form of the datasets. The command line runs with no error but the programs they generate are entirely different when comparing moses with and without atomspace-port . We assumed there must have been a logical error/misunderstanding in the codes when porting Moses to asmoses. We are looking into it...

The entire log file can be found here https://github.com/Eman22S/asmoses/blob/population_branch/scripts/benchmark/asmoses-bench.log The log file was generated by https://github.com/Eman22S/asmoses/blob/population_branch/scripts/benchmark/mb-example.sh and https://github.com/Eman22S/asmoses/blob/population_branch/scripts/benchmark/asmoses-bm.sh

ngeiswei commented 4 years ago

Thanks @Eman22S that's a very useful report.

Could explain how you obtained the Cowles.data and Melanoma.csv data sets? And how to access to them, if possible.

I believe --store-atomspace=1 by default, which is consistent with the fact that there's no substantial difference in your benchmark. You should replace it by --store-atomspace=0. Oh, actually it's not even enabled in the C++ code! See

https://github.com/singnet/asmoses/blob/a4d84fec66262cefdc9a2399e575b8280fd1556b/opencog/moses/representation/instance_scorer.cc#L80-L82

I have forgotten why it was disabled, I believe it was failing on circleci for unknown reason or such. Anyway, at this stage it should be re-enabled.

Eman22S commented 4 years ago

@ngeiswei Cowles and Melanoma are datasets found in R datasets https://github.com/vincentarelbundock/Rdatasets/blob/master/csv/MASS/Melanoma.csv . https://github.com/vincentarelbundock/Rdatasets/blob/master/csv/carData/Cowles.csv. Now ofcourse we did not use the original forms of the files because Moses simply can't run them so we binarized them. Here you can find the binarized datasets https://github.com/Eman22S/asmoses/blob/population_branch/scripts/benchmark/datasets/

As for the commented codes in the instance_scorer.cc, like you said it fails on several computers and works on others. (I believe it worked on your computer). We were just discussing with bitseat on reproducing your environment so we understand better what's going on.

Eman22S commented 4 years ago

Update

Moses port running on Fi_Miller_et_al14_upd dataset https://data.giss.nasa.gov/modelforce/Miller_et_2014/Fi_Miller_et_al14_upd.txt fimiller

Please Note that the dataset is the binarized form Fi_Miller_et_al14_upd renamed as fimiller.csv

ngeiswei commented 4 years ago

Thanks @Eman22S. It would be good to understand why there is such slow down, especially on such a small dataset. I suppose it would be good to profile these two runs, maybe with valgrind if it doesn't blow up the RAM.

Also, I would suggest that you create a commit for each experiment, containing the command line and the dataset used, and push these commits to a feature branch on singnet/asmoses, called something like atomspace-port-experiments (I think you should have the rights, let me know otherwise). Then here, on the github issue, alongside the results, you include the commit hash of the experiment. This allows to reproduce the experiments and faithfully compare them if needed.

Eman22S commented 4 years ago

Thanks for you commet @ngeiswei . Beside the obvious poor performance on these datasets, as I have pointed out earlier, the candidate programs generated betweenmosesand asmoses are incomparable. asmoses seems to produce flat out float numbers as opposed to programs with operators and nested operators like the ones produced by moses for a given dataset. That is undoubtedly a non trivial problem that should be looked into. My suggestion would be to look into these codes that are producing this erroneous outputs as they might be likely the ones contributing to the slow down as well.

I think we can do that while also using valgrind to examine what codes uses the most of the resources. We can rewrite these codes if necessary while simultaneously optimizing it using valgrind.

ngeiswei commented 4 years ago

Oh, indeed, at this point of the port the behaviors should be identical, so yes, it would be good to understand why the candidates are different first.

singnet / asmoses