readsoftware / ReadIssues

This is an issue repository for READ. Intended for issues and feature change request that arise during testing and development.
0 stars 0 forks source link

Crossline Token split doesn't update token location for next line token #193

Open stevewh opened 3 years ago

stevewh commented 3 years ago

With a token that extends across physical lines, splitting it at the line break works to create separate tokens, but doesn't recalculate the second line's token's location.

running the service for refreshEditionWordLocations.php?db=yourDBName&ednID=### is teh current work around.

xadxura commented 3 years ago

Dictionary picks up location from tokenization. Token sequence isn't getting updated with refreshEditionWordLocations.php

stevewh commented 3 years ago

We need to have a discussion about the Tokenization containment versus Physical Lines. It became clear that tokens wrap across one or more physical lines, at which point the code could no longer maintain alignment with physical lines. We cannot assume that the token "TextDivision" sequence containers are labeled the same as physical lines and cannot assume that they align. The Physical line number or range is calculated from the first and last grapheme of the token and following these thru their syllablecluster to the physical line. The line number is accurate, while the line token position is approximate.

On Sat, Jan 9, 2021 at 7:15 PM Andrew Glass notifications@github.com wrote:

Dictionary picks up location from tokenization. Token sequence isn't getting updated with refreshEditionWordLocations.php

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/readsoftware/ReadIssues/issues/193#issuecomment-757345818, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARYOINJW3MGE4LFG6VF5CDSZCMKPANCNFSM4V3V43AQ .