mahfuzibnalam / terminology_evaluation

MIT License
21 stars 7 forks source link

Tokenizer for French #6

Closed thomasZen closed 2 years ago

thomasZen commented 2 years ago

Hi,

I tried to use this repository to score my own hypothesis, but couldn't reproduce the tokenization you used. Specifically, the util_preprocessor.py called here is not present in this repository (I think).

When using moses for tokenization I get a different tokenization compared to the reference in the test data. Here's an example:

Input: ... d’action ...
Moses: ... d ’ action ... (apostrophe as its own token)
Reference in Test Data: ... d' action... (apostrophe attached to d)

Can you share the code you used for tokenization or which toolkit you used?

Thanks for this repository, Thomas

thomasZen commented 2 years ago

This command seems to do the trick, at least for the example I cared about: echo "d'action" | ~/GitRepos/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr -no-escape

In the comment above I used a different character ( instead of '), which confused me initially.