Closed dolphingarlic closed 5 years ago
“This piece of code uses the segmenter to segment a corpus file and output the segmented sentences into a file. In kazSentenceTokenizer.rb, change the 2-letters code of the source language to the language desired. Here "kk" is code for Kazakh.”
Where do we put this file? Yor are going to install it in home directory.
Does this format work for other languages? this segmenter for Kazakh, and you can check segmenter site to see for which languages working.
When do we ever use this file? This program using to segments source language sentences, either use this program or use your own code for segment the sentences.
“For training, you should run these steps:”
Where did text.arpa come from? Where do we run these commands? For understanding what all of these you need read https://kheafield.com/code/kenlm/
What is the “subdirectory script”? subdirectory scripts inside apertium-ambiguous repository , you have to add our binary file that obtained inside scripts directory.
“Python scripts (exampleken1, kenlm.pyx, genalltra.py) used to score sentences can be found living here https://github.com/sevilaybayatli/apertium-ambiguous/tree/master/scripts. These scripts automatically do their functions.” Do we need to download these scripts? These scripts already inside apertium-ambiguous/scripts directory you dont need to download them.
How do they do their functions? that means you dont need download or write them, and their function is scoring Target sentences and normalizing them
Where do they go? in file CLExec.cpp there is path of them.
“The next step is downloading and compiling yasmet by doing the following:” Which directory do we download it into? You have to download it into home directory. Do we just copy the code into a new file, or do we need additional files? just copy file and then compile it.
“Change the language pair file name to the pair desired in the paths of apertium tools (biltrans, lextor, interchunk, postchunk, transfer) in the file CLExec.cpp”
Where in the code do we change this? in the file CLExec.cpp you need change this paths to your own language pair paths.
What is the language pair file? this file your language pair file like apertium-kaz-tur or apertium-eng-kaz..
There are a number of unclear sections in the documentation of apertium-ambiguous on the wiki. Here is a list of the ones that I have found so far: