Closed abdullah-saal closed 2 years ago
Hi,
are you sure your arpa file is sorted correctly?
It should be sorted in suffix order.
If it is not sorted, you can use the sorting utility src/sort_arpa.cpp
, here https://github.com/jermp/tongrams/blob/master/src/sort_arpa.cpp
Let me know.
Hi, thanks for the quick response, I have tried "sort_arpa", the thing is the output of that command isn't ARPA format, so I couldn't use it with build_trie, not sure if I am misunderstanding something.
Hi,
for example, with the command
./sort_arpa 2 ../test_data/arpa ../test_data/1-grams.sorted.gz arpa_sorted_2grams
you sort in suffix order the 2-grams of the test ARPA file test_data/arpa
.
The output is a file (called arpa_sorted_2grams
in the example above) with all the sorted 2-grams.
You must supply the vocabulary as a list of uni-grams (test_data/1-grams.sorted.gz
) in the example.
So you sort all orders and you concatenate them together (plus the ARPA header too).
Let me know is everything is clear now.
PS. There was a minor bug that I fixed now. So pull the new version of the code before trying again.
Thanks. working now.
Given the following command:
Getting the error:
Where I am sure it exists:
Does it not work with non-ascii characters?