Open sr258 opened 4 years ago
Now that we are close to the forthcoming 1.0 release, I've done some work on this. validForms
is very likely the worst case of what we are going to handle, since it involves a lot of database I/O, and this is not a database system. It's probably a good idea to add an additional cache to the index system at the very least -- this could make I/O very significantly faster. I mean, on a modern system we could virtually put the whole of WordNet in RAM anyway. But it'd be nice if we can get a worst case of each word lookup being 2 or so disk requests, rather than the probably 10-15 we have right now.
A simple lookup on each word isn't too bad, especially with the query cache system in place. We're under a millisecond per word when reading a fairly large block of text with our naive current cache. But I accept that we can do much better with some modest improvements.
So I'm going to break this into a few separate issues for future tracking.
I want to use the library to get base forms of tokens in long texts. I call validFormsAsync and the results seem ok. While the library works in general, access is relatively slow. I seem to get a throughput of about 40 lookups per second. Is this normal or am I doing something wrong?