matthewjwoodruff / pareto.py

Nondominated sorting for multi-objective problems
GNU Lesser General Public License v3.0
113 stars 39 forks source link

Are there numpy version issues? #5

Closed matthewjwoodruff closed 11 years ago

matthewjwoodruff commented 11 years ago

Probably not, since we don't use advanced features. But check.

jdherman commented 11 years ago

I was thinking ... do we even need numpy? We aren't really doing any matrix operations. Numpy is only being used for reading in the files and storing the values in a matrix.

An option might be to read the input files line by line into pure lists. (How inefficient are lists of floats compared to Numpy arrays?). This would give two advantages: (1) columns could contain non-floats, we'd just ignore them ... and (2) the input file doesn't have to fit in memory (although if this is a problem, there are probably going to be other problems too...). Thoughts?

matthewjwoodruff commented 11 years ago

I think we should try it and find out :) I don't know for sure, but I'm guessing numpy has some super-duper optimized C IO routines. On the other hand, we're paying for that with the (non-zero) time it takes to import numpy, plus the inflexibility you mentioned. Certainly with super-large sets, loading the whole table is a liability -- no matter how efficiently numpy uses memory, too much data will trigger page faults, and that's the end of performance. And not being able to handle text columns is a major bummer. If we can have text columns come along for the ride, we can do neat stuff like getting contribution for free (just have a column saying what MOEA produced the solution.) I think the Archive class should work with text columns as written.

matthewjwoodruff commented 11 years ago

So my testing wasn't scientific, but I stopped using numpy and things got maybe 2x faster. (Commit 69288777b0dccd52b8a48a7afd1806de5979616c) I used generator functions to avoid pre-loading the tables, although we do get file handles for all of the inputs right at the start. A big advantage is that we only convert the objectives. Also, while we use objectives and boxes as a proxy for sorting, we retain the original text representation of the rows, so when we write back, we write exactly the same numbers that were in the input. This means that

sort myoutput.txt > a
sort three.ref > b
diff a b

will actually tell you that the two sets are the same!

I'm going to open a new issue -- because we don't validate input until we try to sort it, we could get well into the sort and then crash. We should save partial results.