johnwdubois / rezonator

Rezonator: Dynamics of human engagement
35 stars 2 forks source link

Ghost words create alignment problems when "w" key is pressed #190

Open johnwdubois opened 5 years ago

johnwdubois commented 5 years ago

The problem
When the user chooses to display only "word" rather than "text" (with "w"), there are some tokens that simply disappear (because they contain no alphabetic characters). However, a little white area remains behind, and they continue to occupy a cell in the visible display, displacing the true words.

These can be considered "ghost words": tokens in the Word table that are not real words (e.g. they're non-alphabetic, etc.); but they may occupy space on the screen, when they shouldn't.

To reproduce

  1. Open any file (say, SBC002).
  2. Make a QuickStack of 15-20 lines or so (for the color it provides).
  3. Press "w" to toggle word-form to "word" (not "transcription").
  4. See the "ghost words" (non-words, containing no alphabetic characters), visible as plain white dots against the colored background.
  5. Toggle back and forth with "w" to see them appear and disappear.

Screenshot In this screenshot, the ghost words (invisible, but they occupy a cell) are highlighted in green (SBC002, lines 1-20):

Gost words SBC002_1-20 10

What is needed

  1. Don't show "ghost words", whether as a white space or anything else. There shouldn't be any empty cells in the middle of a line of text. Instead, each cell should be occupied by a real word, not a ghost.
  2. To do this, the adjacent real word should move one cell to the left, occupying the space that was vacated when you press "w" and the ghost word disappears.

How to implement

  1. Add a Boolean field called isWord to the Word grid (or vizWord grid). This would allow Rezonator to know whether a Token is a true Word (isWord = 1, the default) or not (isWord = 0).
  2. Based on the algorithm Rezonator uses for removing non-alphabetic characters from the display string, determine whether the token contains alphabetic characters. Use this to assign the value for isWord. The "words" (tokens) that have isWord = 0 will be things like pauses etc.
  3. (There's a better way to assign isWord, described below, that uses the Kind value from the new SBC import, but this will have to wait.)
  4. Using a strategy similar to that for "Dead" words, make Rezonator treat ghost words (isWord = 0) as if they were Dead. That is:
    • don't display them
    • for alignment purposes, act as if they don't exist
  5. Although Ghost words are treated kind of like Dead words, they should not be marked as Dead; let's keep these functions/values clearly separate.
  6. Exception: There is one exception, which is the EndNote. This is a non-word, but we do want to (always) display it, along with the true words (see #185 ).

Future development: How to identify Ghost words using the "Kind" value

  1. In the Word grid, check the value in the Kind field (column). Real words (as opposed to pauses, breathing, etc.) should have Kind = "word".
  2. Kind = "word" is relevant because not all items in the Word table are true words.
  3. The values for the Kind field in the Word table should come from importing data in the new corpus format.
  4. If there is no Kind field in the Word table, as a temporary measure, do the following:
    • create the Kind field (in the Word table)
    • for tokens that are true words, set Kind = "word"
    • for tokens that are not true words, set Kind = "other"
  5. Note: This is NOT the same as determining whether the token contains some alphabetic characters, so regEx will not be completely accurate for deciding whether a string is a word or not. For example, speaker vocalisms (coughing, etc) and comments by transcribers are written using alphabetic characters, but neither is a true speaker-uttered word. For such tokens, Kind is NOT equal to "word". This is why it is better to take the values for Kind from the Kind value already specified in the imported corpus data.

Alternatives you have considered

  1. Another way to approach this would be to tap into the Place (or TokenPlace) and PlaceWord field in the new SBCorpus csv files. Toggling the "w" between "Text" and "Word" would switch between showing only tokens that have a non-zero integer value only for Place vs. those that have a value (also) for PlaceWord. This has the added benefit the it will directly determine which display column (vizCol) a word should be drawn in.

Additional context See also #185.

terrydubois commented 5 years ago

Words now hide themselves properly, however there is still the issue of alignment. This will be a pretty complex alignment issue to tackle because of the amount of annoying edge cases that exist (i.e. adding punctuation to a Rez-chain and then pressing W, or building a chain where each word is in front of punctuation and then pressing W). Rezonator currently only aligns chains and analyzes stretches at specific moments to reduce Race-to-Infinities, so we may need to include W presses as a time to refresh alignment. While these alignment bugs persist with ghost words, there are no known crashes involved.

johnwdubois commented 4 years ago

With the current SB Corpus data, the best way to address this is when "Kind" is NOT equal to "Word" or "EndNote". (See #558 )

johnwdubois commented 4 years ago

Terry's detailed comment above shows that ghost words create issues for alignment, which still need to be addressed.