sigpwned / berkeleylm

Automatically exported from code.google.com/p/berkeleylm
0 stars 0 forks source link

Get raw ngram count in addition to logProb #3

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
A request for adding the feature to obtain also the raw count of an n-gram if 
Google n-gram data is used in the back-end.

Original issue reported on code.google.com by torsten....@gmail.com on 14 Jul 2011 at 7:14

GoogleCodeExporter commented 8 years ago
Do you need this access to be fast? I have some functionality which you can 
access by doing:
 new NgramMapWrapper<W, LongRef>(lm.getNgramMap(), lm.getWordIndexer());

on a StupidBackoffLm. This gives a Map from List<W> to LongRefs. However, this 
interface is slow due to all the boxing/unboxing. 

Original comment by adpa...@gmail.com on 14 Jul 2011 at 5:39

GoogleCodeExporter commented 8 years ago
Of course, fast is always better :)

However, it seems I have not fully understood the way the library works.
Two questions:
1) As the JavaDocs say that getLogProb() is slow, what is a fast way to get 
this information given a phrase?

2) How is this probability computed given the raw counts in the Google web1t 
corpus? It seems to me there should be an easy way to just invert the process.

thanks for your help,
Torsten

Original comment by torsten....@gmail.com on 15 Jul 2011 at 7:52

GoogleCodeExporter commented 8 years ago
1) NgramLanguageModel.getLogProb(List<W>) is "slow" because it has to turn the 
List<W> into an int[] first. Note that it is not actually "slow", just slow 
relative to the efficient accessors in 
ArrayEncodedNgramLanguageModel.getLogProb(int[]) and 
ContextEncodedNgramLanguageModel.getLogProb. I have added additional comments 
that direct you towards those calls so others are not confused by this. 

2) The probability is computed using Stupid Backoff. I have added a call to 
StupidBackoffLm that grabs the count, and will be releasing a new version of 
the code with this fix shortly. 

Original comment by adpa...@gmail.com on 15 Jul 2011 at 6:19