iter_confusion_matrices a bottleneck

ntamas / yard

Yet another ROC curve drawer

MIT License

21 stars 6 forks source link

iter_confusion_matrices a bottleneck #4

Open yasirs opened 13 years ago

yasirs commented 13 years ago

I am experimenting with very large datasets (~ 1e6 to 1e7 points). It seems that storing the data as (threshold, label) tuples and then computing the measures and confusion matrices in python is much, much slower than keeping the data in numpy arrays (where available), and doing vectorized operations on the arrays. I don't know if there is interest in something like this.

I might attempt to implement something like that to be abe to handle the large datasets.

ntamas commented 13 years ago

I would definitely be interested in a NumPy-based solution. I'd suggest you start working on it in a separate branch in your fork and then file pull requests when there's something to be merged.

Also, I'd be glad if you could keep the NumPy-based version API-compatible with the original one as much as possible. In the end, I would like to have a version which works with NumPy if that is installed, but which can also live without NumPy.

yasirs commented 13 years ago

Yeah, that's what I am trying to do. I have a branch 'fastnumpy' where I am writing this, and it will definitely run without numpy. The user-facing API, like data and cuve inits will be compatible, but I am trying to move away from the tuple based storage to independent rows, and this will change some public method signatures, but mostly those not used by most users.

ntamas commented 13 years ago

Great! Keep me posted.