jermp / tongrams

A C++ library providing fast language model queries in compressed space.
MIT License
128 stars 20 forks source link

Trying build_trie with arpa file. #20

Closed abdullah-saal closed 2 years ago

abdullah-saal commented 2 years ago

Given the following command:

 ./build_trie ef_trie 3 prob_backoff --remapping 2 --u -20.0 --p 8 --b 8 --arpa lmclean.arpa   --out ef_trie.prob_backoff.bin

Getting the error:

arpa file contains wrong data:
        'السعدي' should have been found within previous order grams

Where I am sure it exists:

grep "السعدي" lmclean.arpa
-5.097526       السعدي  -0.3184454

Does it not work with non-ascii characters?

jermp commented 2 years ago

Hi, are you sure your arpa file is sorted correctly? It should be sorted in suffix order. If it is not sorted, you can use the sorting utility src/sort_arpa.cpp, here https://github.com/jermp/tongrams/blob/master/src/sort_arpa.cpp

Let me know.

abdullah-saal commented 2 years ago

Hi, thanks for the quick response, I have tried "sort_arpa", the thing is the output of that command isn't ARPA format, so I couldn't use it with build_trie, not sure if I am misunderstanding something.

jermp commented 2 years ago

Hi, for example, with the command ./sort_arpa 2 ../test_data/arpa ../test_data/1-grams.sorted.gz arpa_sorted_2grams you sort in suffix order the 2-grams of the test ARPA file test_data/arpa. The output is a file (called arpa_sorted_2grams in the example above) with all the sorted 2-grams. You must supply the vocabulary as a list of uni-grams (test_data/1-grams.sorted.gz) in the example.

So you sort all orders and you concatenate them together (plus the ARPA header too).

Let me know is everything is clear now.

PS. There was a minor bug that I fixed now. So pull the new version of the code before trying again.

abdullah-saal commented 2 years ago

Thanks. working now.