usc-isi-i2 / dig-crf

CRF++ extraction for DIG
Apache License 2.0
2 stars 1 forks source link

HTML Entity Encoding #2

Open CraigMiloRogers opened 8 years ago

CraigMiloRogers commented 8 years ago

The text field sometimes contains HTML entities, such as "". The allTokens fields contain the same entities, encoded with HTML escapes, e.g. "<u>". 1) Should they be there, or will they interfere with CRF processing? 2) Do we want to encode them as HTML entities or as escaped sequences in the text field? 3) Do we want to encode them as HTML entities or as escaped sequences in the allTokens fields?

We need to ensure that CRF training and CRF testing use the same convention.

szeke commented 8 years ago

Don't know, this is a question for David