We would need to separate the choice of input and output format. At the moment, the -wxml option changes the output format so that it contains the original string enclosed in the <word> element, but it also expects the input to be some kind of XML - which means that any XML-like text contents are just ignored and not analysed. Without the -wxml option, the input is treated as plain text (that is what we want), but the output does not contain the original string anymore (just some lower-cased version). Could we somehow get both?
We would need to separate the choice of input and output format. At the moment, the
-wxml
option changes the output format so that it contains the original string enclosed in the<word>
element, but it also expects the input to be some kind of XML - which means that any XML-like text contents are just ignored and not analysed. Without the-wxml
option, the input is treated as plain text (that is what we want), but the output does not contain the original string anymore (just some lower-cased version). Could we somehow get both?