jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Character encoding problems in the NCI #4

Closed fosterjen closed 4 years ago

fosterjen commented 4 years ago

Some characters in the NCI are not properly encoded. This affects characters in otherwise ok sentences, or whole blocks of text.

jowagner commented 4 years ago

Findings so far:

jowagner commented 4 years ago

doc id="itgm0022", doc id="icgm1042" and doc id="iwx00055" have unescaped & in the value of attributes. XML parser not happy. Implemented workaround in commit a5a27e2 line 52.

jowagner commented 4 years ago

Update (with help from Teresa and Lauren):

jowagner commented 4 years ago

Created separate issues for all issues mentioned above.