hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
487 stars 57 forks source link

Initial capitals never regenerated (truecaser) #26

Closed mlforcada closed 5 years ago

mlforcada commented 5 years ago

Do sentences have to be delimited in some way? I have trained the truecaser with a 3032679-word tokenized text in Spanish (1 sentence per line). It generates a model which has 102972 entries (is it a unigram-based truecaser?). Then I use it to truecase a similar text in Spanish which has all been lowercased. The case of many proper nouns, etc., is correctly recovered, but nothing occurs at the beginning of sentences. Am I doing something wrong? Thanks a million!

alvations commented 5 years ago

Typically, the pipeline is to first have a text file that's non-tokenized, then put it through the tokenizer.

E.g. if I have the big.txt that I would like to preprocess, I'll first do:

sacremoses tokenize -j 4 < big.txt > big.txt.tok

Then afterwards, train the truecasing model using the tokenized file, i.e.

sacremoses truecase -m big.model -j 4 < big.txt.tok > big.txt.tok.true

The above command would first train the truecaser model and save to big.model, then take the tokenized file big.txt.tok and apply the truecase model to it to produce big.txt.tok.true.

The truecaser on paper should consider more than Unigrams but in Moses, the truecaser is generally unigram based, c.f. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl

The actual implementation of the truecaser with n-th order ngrams is call the "recaser" from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-recaser.perl

It would require some sort of language model (LM) to be built first and currently, recaser and n-gram language model not implemented in sacremoses (yet).

alvations commented 5 years ago

@mlforcada are you having issues with the detruecaser that doesn't undo the truecasing of the initial capital?

mlforcada commented 5 years ago

Apologies, @alvations. I just realised I misunderstood the role of the truecaser. I will check what you kindly wrote above. Apologies for that.

Mikel