The text field sometimes contains HTML entities, such as "". The allTokens fields contain the same entities, encoded with HTML escapes, e.g. "<u>".
1) Should they be there, or will they interfere with CRF processing?
2) Do we want to encode them as HTML entities or as escaped sequences in the text field?
3) Do we want to encode them as HTML entities or as escaped sequences in the allTokens fields?
We need to ensure that CRF training and CRF testing use the same convention.
The text field sometimes contains HTML entities, such as "". The allTokens fields contain the same entities, encoded with HTML escapes, e.g. "<u>". 1) Should they be there, or will they interfere with CRF processing? 2) Do we want to encode them as HTML entities or as escaped sequences in the text field? 3) Do we want to encode them as HTML entities or as escaped sequences in the allTokens fields?
We need to ensure that CRF training and CRF testing use the same convention.