hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
487 stars 57 forks source link

Tokenizer -x option is confusing #98

Open ZJaume opened 4 years ago

ZJaume commented 4 years ago

The -x option says on the usage:

-x, --xml-escape               Escape special characters for XML.

And it does the same as -no-escape option in Moses.

alvations commented 4 years ago

Hmmm true. But the point is to keep the interface pythonic, but I agree it's confusing. Let me think of a better wording for the feature =)

ZJaume commented 4 years ago

What about something like

-x, --no-xml-escape      Don't perform escaping special characters for XML.

or just removing the shortened form -x and leave the --no-xml-escape? If --no-xml-escape is too long why not simply --no-escape like Moses?

I think it should at least have the "negation" on the help message because it is very confusing.

bricksdont commented 4 years ago

Agreed, the option name and help text definitely do not make sense.

But then, does the default behaviour need to be that special XML characters are escaped (legacy behaviour from SMT/Moses)? I totally understand if the argument is that sacremoses should behave exactly like the original Moses tokenizer.