Mapping of offsets from raw text back to HTML needs to be better than what it is right now (a bad heuristic). The low MAP score (~11% Doc level) must be because of this.
Need to check for every word token from right and left, and find possible boundaries in HTML.
Mapping of offsets from raw text back to HTML needs to be better than what it is right now (a bad heuristic). The low MAP score (~11% Doc level) must be because of this. Need to check for every word token from right and left, and find possible boundaries in HTML.