phonlab-tcd / An-Scealai

An Scéalaí is an open-source online learning platform for teachers and students of the Irish language.
GNU General Public License v3.0
7 stars 3 forks source link

Digital Reader - Words with punctuation being split by the POS tagger #539

Open DavidMockler opened 3 months ago

DavidMockler commented 3 months ago

Word units that contain punctuation (e.g "b'iontach", "D'imigh" etc.) are split by the POS tagger and are subsequently treated as separate words. These should probably be remerged at some point in the pipeline so that they are treated as one unit in the final dr-story-viewer. This leads to indexing issues when one or other of the units are clicked in the story viewer. This is not the most pressing issue.