jermp / tongrams_estimation

A C++ library implementing fast language models estimation using the 1-Sort algorithm.
MIT License
17 stars 2 forks source link

Count Feature Requests #3

Closed misner5 closed 1 year ago

misner5 commented 1 year ago

Hello, this a great tool and having looked through the code it seems to be really well made.

I was trying to figure out how to modify this code base to do a few more things.

1) Count n-grams of 1 and 2 words. I realize you've likely disabled this because it makes huge files. But if it could optionally block counts lower than 1 or N it could be extremely useful, and have manageable file sizes.

2) Filter out low counts when counting n-grams. Basically just made this tool better at extracting word patterns but ignoring single or low counts that don't amount to a pattern. Ideally it would be an input on the command line.

Thanks for making this great open source tool!

misner5 commented 1 year ago

Just reflecting on this, I guess you kind of have this information in the "arpa" file output in Tongrams. But it would be useful for that kind of information with counts.

jermp commented 1 year ago

Hello @misner5, and thank you for your kind words. Happy to know you found Tongrams useful.

I may have not understood what you mean in point 1 above. The n-gram length (i.e., the "n") is an input for the tool. So if you specify n=2 then you count n-grams of 1 and 2 words as you say. By default, I'm counting all n-grams of length 1..n.

Feature 2 could be nice to add. Right now I'm counting everything without pruning as it seems what it is done by other tools like KenLM.

Best, -Giulio

misner5 commented 1 year ago

Ok... yah I was able to get point 1 to work with a bit of tinkering in count.cpp. The 2 command lines I was running are:

./count ../test_data/1Billion.1M 1 --tmp tmp_dir --ram 0.25 --out 1-grams
./count ../test_data/1Billion.1M 2 --tmp tmp_dir --ram 0.25 --out 2-grams

I just modified the code that was blocking that on lines 53-56:

    if (config.max_order < 1 or config.max_order > global::max_order) {
        std::cerr << "invalid language model order" << std::endl;
        return 1;
    }

After I did that it all worked ok... for some reason it was blocking n-grams <= 2

If you're interested I can show you the project we're working on using this tool... you might find it interesting... I'll contact you directly for that.

Cheers and thanks again for all you hard work and open source contributions in this area, Michael

jermp commented 1 year ago

Hi @misner5, good to know you got what you wanted. Thank you for your message! I just replied to you. Best, -Giulio