ropensci / hunspell

High-Performance Stemmer, Tokenizer, and Spell Checker for R
https://docs.ropensci.org/hunspell
Other
109 stars 44 forks source link

Calling functions is very slow - add dictionary caching and mapping #11

Closed ireyoner closed 8 years ago

ireyoner commented 8 years ago

When calling functions even with one word it is very slow - in C++ dictonary class is created with every function call, that is very time consuming.

If you create dictionary on first call only once for each language and then pass this class to functions it will be faster.

I have multiple ~1000 short texts and calling hunspell_stem on each takes ~1.4 sec. After modifying your function code to take list of vectors as argument:

List R_hunspell_stem(Rcpp::String affix, CharacterVector dict, List texts){

  //init with affix and at least one dict
  hunspell_dict mydict(affix, dict);
  List out;
  for(int i = 0; i < texts.length(); i++){
    StringVector words = texts[i];
    List out2;
    for(int j = 0; j < words.length(); j++)
      out2.push_back(mydict.stem(words[j]));
    out.push_back(out2);
  }
  return out;
}

Execution time takes ~10 sec for all texts instead of ~24 min.

But then I need to call hunspell_suggest or hunspell_analyze depending on text and it again takes ~1.4 sec for each call.

jeroen commented 8 years ago

Right that makes sense. Can you send a PR?

ireyoner commented 8 years ago

By PR do you mean Pull request? If so, then it is not possible for me for now, I will do it next week, but although I know what to fix, I have no idea how it should be implemeted, must look for solution in Iternet.

I was also looking through other issues and came into conclusion that issue #7 "hunspell_suggest is slow" has the same problem as this reported by me.

ireyoner commented 8 years ago

I think I fixed it. New version is working for me when I call document(), but I get the same error as in issue #12. Sended a Pull request with my changes.

jeroen commented 8 years ago

Thanks a lot for your thoughts and suggestions. Your PR had some nice ideas, but I prefer a lighter and more simple solution.

I added a solution which makes dictionaries a separate object and automatically caches them via memoise. Let me know if this solution works for you!

ireyoner commented 8 years ago

I think it is still not working - after installing from cran via install.packages("hunspell") and testing using microbenchmark(hunspell_info()) I got:

Unit: seconds
            expr      min       lq     mean   median       uq      max neval
 hunspell_info() 13.40211 13.84603 13.99256 13.98134 14.16155 14.57229   100

but I see that those changes are not on cran yet.

Then after installing from gihub via install_github("ropensci/hunspell") and trying to call hunspell_info() i got first error:

Error in .Call("hunspell_R_hunspell_dict", PACKAGE = "hunspell", affix,  : 
  "hunspell_R_hunspell_dict" not available for .Call() for package "hunspell"

and then another error:

Error: cannot allocate vector of size 3.4 Gb
In addition: Warning messages:
1: In fetch(key) :
  Reached total allocation of 4086Mb: see help(memory.size)

and then the rsession crashed. I didn't do anything more than described above in this session. All function calls were for your default dictionaries.

jeroen commented 8 years ago

Something must have gone wrong with your installation, it passes on all platforms. Can you try to reinstall and make sure hunspell is not loaded why you reinstall?

devtools::install_github('ropensci/hunspell")
ireyoner commented 8 years ago

Ok, now it is working fine. I don't know what was the problem earlier. Thanks for good work. Do you close issue or I should?

jeroen commented 8 years ago

If this all works now for you and you have no futher concerns, you can close the issue.

jeroen commented 8 years ago

OK thanks again for your suggestions!

ireyoner commented 8 years ago

You are welcome, the only issue that old users might find problematic, is that for other languages the parameter dict has changed from text to object. Maybe it would be better to call this "dictionary("en_US")" function inside each "basic" function?

jeroen commented 8 years ago

It should be backward compatible. If you pass a string, it will get passed on to dictionary so you get the old behaviour.

jeroen commented 8 years ago

FYI this is on CRAN now.