Closed misner5 closed 1 year ago
Just reflecting on this, I guess you kind of have this information in the "arpa" file output in Tongrams. But it would be useful for that kind of information with counts.
Hello @misner5, and thank you for your kind words. Happy to know you found Tongrams useful.
I may have not understood what you mean in point 1 above. The n-gram length (i.e., the "n") is an input for the tool. So if you specify n=2
then you count n-grams of 1 and 2 words as you say. By default, I'm counting all n-grams of length 1..n.
Feature 2 could be nice to add. Right now I'm counting everything without pruning as it seems what it is done by other tools like KenLM.
Best, -Giulio
Ok... yah I was able to get point 1 to work with a bit of tinkering in count.cpp. The 2 command lines I was running are:
./count ../test_data/1Billion.1M 1 --tmp tmp_dir --ram 0.25 --out 1-grams
./count ../test_data/1Billion.1M 2 --tmp tmp_dir --ram 0.25 --out 2-grams
I just modified the code that was blocking that on lines 53-56:
if (config.max_order < 1 or config.max_order > global::max_order) {
std::cerr << "invalid language model order" << std::endl;
return 1;
}
After I did that it all worked ok... for some reason it was blocking n-grams <= 2
If you're interested I can show you the project we're working on using this tool... you might find it interesting... I'll contact you directly for that.
Cheers and thanks again for all you hard work and open source contributions in this area, Michael
Hi @misner5, good to know you got what you wanted. Thank you for your message! I just replied to you. Best, -Giulio
Hello, this a great tool and having looked through the code it seems to be really well made.
I was trying to figure out how to modify this code base to do a few more things.
1) Count n-grams of 1 and 2 words. I realize you've likely disabled this because it makes huge files. But if it could optionally block counts lower than 1 or N it could be extremely useful, and have manageable file sizes.
2) Filter out low counts when counting n-grams. Basically just made this tool better at extracting word patterns but ignoring single or low counts that don't amount to a pattern. Ideally it would be an input on the command line.
Thanks for making this great open source tool!