Closed ghost closed 9 years ago
I should elaborate. I'm attempting document classification with this library, and it seems that even learn from a small sample base consumed huge amounts of memory. I'll do some profiling to try and pin down where, but I think I might also attempt to implement some sort of on disk caching so the operation isn't done entirely in memory.
Ruh roh - might need to do some optimization. Do you have sample data + code I could have a go at?
@sethjuarez Hey sorry I didn't get the notification that you replied somehow. Umm, unfortunately my training data is several gigs. lol. But yeah I wasn't stuffing all that 8+ gigs into the library when I getting this issue, was actually a relatively small sample set. Anyway I ended up switching to Java to actually build out my model using openNLP.
Awesome. As long as you got your stuff to work that is perfect.
Was just looking through the library, I see HammingDistance is implemented, did I miss Levenshtein distance somewhere or is that one you haven't implemented yet? Have you not found it as useful a measure? Just curious.
Absolutely useful when dealing with strings. In the case of the lib, I pre-convert everything to numbers. Any ideas where we could add this?
After looking at the library in more detail, how does the numl.Utils.StringHelpers class sound with a LevenshteinDistance method?
It could be used to create a [Feature] that would be the edit distance number that could then be used to train on?
Conversion works in a row (or class) by row manner. In other words, I would need to compare the [Feature] in question with something else in order to get a resulting number. Maybe we can open a new issue?
Doing anything with the library that uses StringFeature causes the application to just sit idly recursively allocating memory until it has consumed all system memory (if you allow it to), regardless of the data set length, string size, trainer parameters, etc.