Chunks 1: Mark a Span - Githubissues

johnwdubois commented 6 years ago

Is your feature request related to a problem?
When a user is marking up Track chains or Rez chains, sometimes they want to include a multi-word sequence as a link, not just a Word. A multi-word sequence can be considered a Chunk, Span, or Phrase. Rezonator should make it easy for users to mark an arbitrary sequence of words as a Chunk, in order to include it in a Track chain or Rez chain.

Definitions

A Span is a continuous sequence of words. A Span is defined by an arbitrary startSpanWord and an arbitrary endSpanWord.
(Similarly, a unitSpan of units or lines is defined by an arbitrary startSpanUnit and endSpanUnit.)
A Chunk is a simplified version of a Phrase, as implemented for Rezonator purposes. A Chunk is like a Phrase (see "Additional Context"), but simpler, flatter, and more practical. (The term Chunk is borrowed from Spacy, which defines noun chunks as "base noun phrases", i.e. "flat phrases that have a noun as their head".
A Plate is a visual representation of a Chunk (or a Phrase). The Plate is displayed as a rectangular box that surrounds the Chunk (or Phrase). See: Plate notation. Plates can Nest.
Nesting is when a Chunk appears within another Chunk (or a Phrase within a Phrase). It can be visualized as a Plate within a Plate. Nesting is recursive.

Example Here is an example of a several Spans/Chunks created while marking a Track chain (from SBC006). Spans appear in lines 10, 11, 13, and 121. In this example, the sideLink notation (borrowed from Rez chains) is intended to stand in for markup of a Span/Chunk. But in the actual Span/Chunk notation, no SideLinks would appear. Instead, the Span (sequence of words on the same line) would be surrounded by a single bounding rectangle (Plate).

track chain with spans-chunks - example

Here is another example. In this case the user would like to mark some Chunks in the midst of creating a Track chain (see the multi-word references to "granola woman" etc. in SBC004, lines 1289, 1291, and 1293):

Track chain with Spans-Chunks - example 02

Describe the solution you'd like The goal is to allow the user to mark an arbitrary set of words as a Span (or Chunk).

To mark a Span or Chunk (using the SnapToGrid drag #222):
- click on the first word of the Span
- drag the cursor over one or more additional words on the same line
- release the mouse-click, to mark the last word of the Span/Chunk.
As a gesture, this is just dragging a finger over the Span of words.
The Span can defined by an arbitrary startSpanWord and an arbitrary endSpanWord.
If the user dragged "backwards" (right to left), reverse the startSpanWord and endSpanWord, so they will appear in the standard sort order (left to right).
The process should be fast and seamless, allowing the user who is in the middle of marking a Track chain to:
- markup a Chunk with a single stroke; and then
- continue marking up the same Track chain (adding more Words or Chunks to it), without any interruption or special action required.
The drag that marks a Chunk should follow the SnapToGrid system #222.

Data structure issues

A Chunk allows users to drop words from a Span, so it must be specified differently from a Span: as a list of Words (via their wordID's).
To represent Spans and Chunks internally in Rezonator, it may be informative to think about the strategy used for the SideLink (#82 ), which accommodates multiple words on the same line. The SideLink code may provide some inspiration, - especialy for how to align Chunks with Words. (But significant differences will likely be required for the Span code.)
By default, the Span will become a Chunk. Because a Chunk is basically a list of words that acts like a single word (for some purposes, such as sorting the links in a Track chain), we will need to either:
- put Chunks into the Word or VizWord grid (and mark them as Chunks), or
- create a Chunk grid that is similar or identical to the Word or vizWord (a.k.a. DynaWord) grid.
Whenever a new Chunk is created, it needs a unique ID value. Two alternatives:
- a new wordID is generated for the Chunk, or...
- a new chunkID is generated, which is intended to serve basically the same functions as a wordID
The Chunk grid should record a list of wordID's for each of the words that are included in the Chunk. (Using a list will allow for selective skipping or deleting of words from the Chunk, unlike a Span.)
It may be necessary to create a new field in the Word grid, or the vizWord grid (a.k.a. DynaWord), to specify that a "word" is actually a Chunk (Chunk = TRUE).

Additional context

By default, when a user is marking Track chains, Justification should be set to Left-justified.
Nesting: Because Chunks can be Nested one inside the other, the Plate rectangles should also allow for nesting multiple Plates, one inside the other. For example, "this former car radio thief" (SBC006, line 10) has:
- [this [former [[car radio] thief]]]
- the brackets represented nested Plates (see #134 )
The use of Chunks (instead of Phrases) as the main constituent unit for Rezonator and Spacy fits well with the philosophy of a "shallow parse".
DEFINITION: A Phrase is a set of words that belong together (for syntactic or semantic reasons). A typical example is a noun phrase (such as "the new version"). But in linguistic theory, Phrases can get quite complicated (and long), with multiple levels of embedding (or nesting). Typically, the words in a Phrase (or Chunk) all appear as a continuous sequence within the same line (e.g. in the same intonation unit). That is, the words constitute a Span. But sometimes, a Phrase (or Chunk) can extend over two or more lines, and can skip over some words (i.e. be discontinuous). Because a Phrase can be discontinuous, it has to allow for the possibility that some words within the Span will be excluded from the Phrase.

Alternative user input: Shift-Click

Shift-click method. Alternatively, to mark a Span, first click on a word. The word is now highlighted (in focus). This will (potentially) become the first Word of a new Span (startSpanWord).
Shift-click on another word. (Usually this word will be to the right of the first word, within the same line, but it could be on a different line.) This word is now the end of the Span (endSpanWord).

johnwdubois commented 5 years ago

I changed the previous terminology, replacing the word "span" with "phrase". Spans are used in NLP for a different meaning (e.g. a continuous string), so "phrase" is more consistent with standard linguistic terminology.

Georgio-Klironomos commented 5 years ago

chunkV1 Here's how the newWord chunks look. There's a couple problems: when the chunk is too large, it bleeds into the next display column, and just like the newWords they cannot be aligned correctly

rezlinguist commented 5 years ago

It might be better to let the chunk be drawn according to the words that compose it, and let it be sorted and aligned by its first word. Jack

============================== John W. Du Bois Professor of Linguistics University of California, Santa Barbara Santa Barbara, California 93106 USA dubois@ucsb.edu

On Wed, May 22, 2019, 9:46 AM Georgio Klironomos notifications@github.com wrote:

[image: chunkV1] https://user-images.githubusercontent.com/44912030/58139541-fb98b500-7bef-11e9-9497-353eb79e9048.png Here's how the newWord chunks look. There's a couple problems: when the chunk is too large, it bleeds into the next display column, and just like the newWords they cannot be aligned correctly

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/johnwdubois/rezonator/issues/24?email_source=notifications&email_token=AIQ6S75CK26ZVRRA5ARO2I3PWSJVBA5CNFSM4FRTV4VKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV5SMTY#issuecomment-494609999, or mute the thread https://github.com/notifications/unsubscribe-auth/AIQ6S76D5EIWUNAF4X5VX4LPWSJVBANCNFSM4FRTV4VA .

Georgio-Klironomos commented 5 years ago

So Jack, would that be the way that spans used to work? Not inserting a new "word", but drawing the box around the selected words and changing the properties of the first word to act as the chunk's anchor?

rezlinguist commented 5 years ago

Yes, kind of. In a way we need a hybrid solution, because for some purposes we do need to have a new "word", as a node that we can attach features to, and include in links to normal words. But this chunk-word can be somewhat abstract, so for drawing purposes we don't want to use the "insert user word" approach, I think. Just attach the box to certain normal words, which will be drawn according to the normal rules as much as possible.

============================== John W. Du Bois Professor of Linguistics University of California, Santa Barbara Santa Barbara, California 93106 USA dubois@ucsb.edu

On Fri, May 24, 2019, 2:30 AM Georgio Klironomos notifications@github.com wrote:

So Jack, would that be the way that spans used to work? Not inserting a new "word", but drawing the box around the selected words and changing the properties of the first word to act as the chunk's anchor?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/johnwdubois/rezonator/issues/24?email_source=notifications&email_token=AIQ6S7YYU4BHD42F6AC7PBTPW3IFBA5CNFSM4FRTV4VKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWC5Z5I#issuecomment-495312117, or mute the thread https://github.com/notifications/unsubscribe-auth/AIQ6S764OCBL7XM5TKUVUNTPW3IFBANCNFSM4FRTV4VA .

johnwdubois commented 5 years ago

The Chunk/Box function looks promising. Here are some tweaks:

Make sure that all of the same words that are being highlighted for the user during the drag are included in the final Box wordList. Currently it looks like the last word of the Box is sometimes lost.
As much as possible, the Chunk/Box should play nicely with other functions, such as marking Rez chains, Track chains, QuickLinks, etc., letting you switch seamlessly between them. Rather than depending on a brush to specify the user intent as Chunk vs Track, try to recognize the user intent from the direction of the drag. For example, if the user is in the middle of making Track chains, and then does a horizontal mouse drag, and then releases it, this should make a Chunk, and then allow the user to continue making Track chains. Ideally it would actually add the new Chunk to the currently focused Track chain, and then allow the user to immediately go back to adding more Track links to the current Track.
To implement user intent efficiently, it will be important to test for DragDirection and DragAngle, as outlined in #222 . DragDirection = {down, up, right, left, southeast, southwest, northeast, northwest} DragAngle = {straight, diagonal} If DragDirection = {down, up, right, left}, then DragAngle = straight If DragDirection = {right, left}, then DragHorizontal = 1 If DragDirection = {southeast, southwest, northeast, northwest}, then DragAngle = diagonal
Depending on the drag angle, you can recognize the user intent as making a Chunk, or QuickLinks, or whatever. This will allow you to let them mark one Chunk (even without specifying the Box brush), and then go back to what they were doing before.
We should assume that users who are making Rez chains or Track chains will often want to make a single Chunk, include it in the current focused Rez or Track chain, and then go back to making the Rez or Trach chain they were making before.

Georgio-Klironomos commented 5 years ago

Chunks no longer create a new word, instead they are designated by a rectangle drawn around the words in the chunk. The rectangle is dynamic in that it can be stretched indefinitely but still contains the words.

johnwdubois commented 5 years ago

We use the Box simply as a tool to define a Chunk. So the Grid that records this information should be called Chunk, not Box.
The Chunk box scales up with line height. It would be better to have it stay closer to the size of the word box (that is, defined by font height and word length, rather than column width etc.
When you mark a small (2-word) Chunk (#1), and then another Chunk (#2) that includes Chunk#1 plus a couple more words, the best way to represent this "nesting" relationship is that Chunk#2 contains Chunk#1 plus the extra couple of words. (So, represent Chunk#2 by listing the ChunkID for Chunk#1, rather than just a longer list of words.)
When a Chunk#1 is nested inside another Chunk#2, the box for Chunk#2 should be a little larger than the box for Chunk#1. It should enclose the smaller box, rather than overlapping with it.
Chunks don't allow you to include the last 2 words of the line. This should be allowed.
To distinguish Chunks from other items, make the color a dark grey, rather than black.
If possible, let's avoid giving analogue dimensionality to the Box when marking a Chunk. That is, as long as the drag stays within the confines of a single line, show only the discrete values (grey blocks) that show which words are in the Chunk.

johnwdubois / rezonator

Chunks 1: Mark a Span #24