Closed mlforcada closed 5 years ago
Typically, the pipeline is to first have a text file that's non-tokenized, then put it through the tokenizer.
E.g. if I have the big.txt
that I would like to preprocess, I'll first do:
sacremoses tokenize -j 4 < big.txt > big.txt.tok
Then afterwards, train the truecasing model using the tokenized file, i.e.
sacremoses truecase -m big.model -j 4 < big.txt.tok > big.txt.tok.true
The above command would first train the truecaser model and save to big.model
, then take the tokenized file big.txt.tok
and apply the truecase model to it to produce big.txt.tok.true
.
The truecaser on paper should consider more than Unigrams but in Moses, the truecaser is generally unigram based, c.f. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
The actual implementation of the truecaser with n-th order ngrams is call the "recaser" from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-recaser.perl
It would require some sort of language model (LM) to be built first and currently, recaser and n-gram language model not implemented in sacremoses
(yet).
@mlforcada are you having issues with the detruecaser that doesn't undo the truecasing of the initial capital?
Apologies, @alvations. I just realised I misunderstood the role of the truecaser. I will check what you kindly wrote above. Apologies for that.
Mikel
Do sentences have to be delimited in some way? I have trained the truecaser with a 3032679-word tokenized text in Spanish (1 sentence per line). It generates a model which has 102972 entries (is it a unigram-based truecaser?). Then I use it to truecase a similar text in Spanish which has all been lowercased. The case of many proper nouns, etc., is correctly recovered, but nothing occurs at the beginning of sentences. Am I doing something wrong? Thanks a million!