Open jtauber opened 5 years ago
@jtauber can you provide more context or a link for "Reference Model: C2" above? This seems like it will be another "unit" of chunking, right?
Reference Model: C2 is just the broad classification of features from https://github.com/deep-philology/DeepReader/wiki/A-Reference-Model-for-Capabilities-of-Online-Readers
It might not be a chunking type (although it is in MorphGNT, see https://github.com/jtauber/vocabulary-tools/tree/master/gnt_data ). It could just be visual styling, e.g. indentation or margins.
In other words, it's rare for people to use paragraphs as a citation scheme but they are very common just as a way to visually break up the text. Obviously the fact they aren't generally uses as a citation scheme doesn't mean they can't be :-)
Beowulf and Homer both have paragraphs marked up but I don't think anyone would ever say "in paragraph 53...". A nice rendering of either Beowulf or Homer, though, might want to have vertical space between paragraphs or something.
That said, for prose it might be more useful for citation / addressing.
Thanks for the context.
I guess what I was getting at with "chunking" was "grouping", so even if you don't "reference" (as a human) paragraph 53 or we don't handle a "query" (from the frontend) for a particular paragraph, we're doing something within the data layer to annotate that the token with the value μῆνιν
(which is idx
0 for the whole work, position
1 within urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1 is part of a larger "paragraph 1".
I'd still argue there's a difference between structure (which could be used for all sorts of things including visual treatment, citation, pagination, etc) and mere visual treatment.
Note also that often paragraph breaks are what is marked, not the overall structural unit of a paragraph. One could chose to map a paragraph break to <br/><br/><br/>
(ugh!) which might achieve the desired visual effect while having zero to say about structure / grouping / chunking.
All this said, I don't think there's any harm in having the notion of a paragraph reference on a token.
In ReadBeowulf I have:
fitt_id = models.IntegerField(db_index=True)
para_id = models.IntegerField(db_index=True)
para_first = models.BooleanField()
line_id = models.IntegerField(db_index=True)
half_line = models.CharField(max_length=1)
token_offset = models.IntegerField()
(unlike https://github.com/jtauber/vocabulary-tools/tree/master/gnt_data where I have the various chunking schemes defined individually and mapped to token numbers but you can obviously easily switch between one representation and the other)
Note in the Beowful token model I have a para_first
boolean which indicates "this is the first token in a new paragraph". It could be that, for example, that triggers the visual treatment, rather than an actual para_id
.
Reference Model: C2