sethjuarez / numl

Machine Learning for .NET
http://numl.net
MIT License
430 stars 104 forks source link

Using StringFeature Kills Application #16

Closed ghost closed 9 years ago

ghost commented 9 years ago

Doing anything with the library that uses StringFeature causes the application to just sit idly recursively allocating memory until it has consumed all system memory (if you allow it to), regardless of the data set length, string size, trainer parameters, etc.

ghost commented 9 years ago

I should elaborate. I'm attempting document classification with this library, and it seems that even learn from a small sample base consumed huge amounts of memory. I'll do some profiling to try and pin down where, but I think I might also attempt to implement some sort of on disk caching so the operation isn't done entirely in memory.

sethjuarez commented 9 years ago

Ruh roh - might need to do some optimization. Do you have sample data + code I could have a go at?

ghost commented 9 years ago

@sethjuarez Hey sorry I didn't get the notification that you replied somehow. Umm, unfortunately my training data is several gigs. lol. But yeah I wasn't stuffing all that 8+ gigs into the library when I getting this issue, was actually a relatively small sample set. Anyway I ended up switching to Java to actually build out my model using openNLP.

sethjuarez commented 9 years ago

Awesome. As long as you got your stuff to work that is perfect.

normanhh3 commented 9 years ago

Was just looking through the library, I see HammingDistance is implemented, did I miss Levenshtein distance somewhere or is that one you haven't implemented yet? Have you not found it as useful a measure? Just curious.

sethjuarez commented 9 years ago

Absolutely useful when dealing with strings. In the case of the lib, I pre-convert everything to numbers. Any ideas where we could add this?

normanhh3 commented 9 years ago

After looking at the library in more detail, how does the numl.Utils.StringHelpers class sound with a LevenshteinDistance method?

It could be used to create a [Feature] that would be the edit distance number that could then be used to train on?

sethjuarez commented 9 years ago

Conversion works in a row (or class) by row manner. In other words, I would need to compare the [Feature] in question with something else in order to get a resulting number. Maybe we can open a new issue?