tibetan-nlp / classical-tibetan-corpus

Linguistically analyzed Classical Tibetan texts
24 stars 0 forks source link

double shad issue #1

Open ngawangtrinley opened 6 years ago

ngawangtrinley commented 6 years ago

ས་ [གྱིས་√case] case.agn ADP Case=Agn #18->17 @case མཆོད [མཆོད་] v.invar VERB #19->19 ། [།] punc PUNCT #20->19 @punct ། [།] punc PUNCT #1->1

The second shad isn't the start of the next sentence, it is part of the previous one, double and quadruple mark the end of respectively a verse/minor section, or a chapter/major section. This should be:

། [།] punc PUNCT #20->19 @punct ། [།] punc PUNCT #21->19 @punct

heacu commented 6 years ago

are you sure? i'm convinced i've seen texts with one shad before and one shad after a "sentence". i think they are page scans, too.

in any case, i'm not sure this has much consequence because in the example above the shads are just used to delimit the windows used for the grammars being applied, so whether a shad is separated out into a separate "sentence" or not doesn't really impact anything. and you might just consider the display of it a rendering issue.

heacu commented 6 years ago

well, i suppose if we are really coding the punctuation properly then it is important to get the dependency right for double shad vs single shad, since they have different implications as you say in terms of text divisioning, but i guess since we in our project are not attempting to deal with discourse relations we haven't been too worried about getting this right.

ngawangtrinley commented 6 years ago

You're right that some texts are formatted this way, and that's usually bad formatting. Double shad actually have a meaning, so I think it would make sense to format them properly if it's not too much extra work on your side. This will avoid confusion when people use your data for further analysis. It'll also be easier on the eye of Tibetans. I guess the equivalent in English would be "Hello !My name is Paul .I'm from France .", which looks weird and might induce people in thinking that sentences start with full-stops.

They show the end of verses, paragraphs, chapters, or texts. It's actually not an ambiguous topic in Tibetan grammar texts, the most common explanation you'll find is this section of the ལེགས་བཤད་ལྗོན་དབང་། : ལྷུག་པའི་དོན་མང་མིང་མཚམས་དང་། ། དོན་འབྲིང་འབྱེད་དང་དོན་ཉུང་རྫོགས། ། ཚིགས་བཅད་ག་མཐར་ཆིག་ཤད་བྱ། ། # Single shad are used for prose, isolated words, medium and small sections, and after ག in verses. རྫོགས་ཚིག་མཐའ་ཅན་ལྷུག་པ་དང་། ། ཚིགས་བཅད་རྐང་མཐར་གཉིས་ཤད་ཐོབ། ། # Double shad are used after rdozs tshig in prose, and at the end of verse. དོན་ཚན་ཆེན་མོ་རྫོགས་པ་དང་། ། ལེའུའི་མཚམས་སུ་བཞི་ཤད་དགོས། ། # 4 shad should be used to mark the end of larger sections and chapters. ང་ཡིག་མ་གཏོགས་ཡིག་ཤད་བར། ། ཚེག་མེད་དེ་སོགས་ཞིབ་ཏུ་འབད། ། # Apart from the letter ང tsek shouldn't be inserted between letters and shad. Here's a commentary with examples if it's of any help.

heacu commented 6 years ago

Thanks for this, but two points. First, just to reiterate, a blank line in the dependency structure doesn't imply anything about the display, so we are free to apply whatever rendering rules we like.

Second, are you really willing to claim that all counterexamples are "bad formatting"? I just did a random search on TBRC, and on page 3 of this random text there are plenty of counterexamples, as you can see.

page3

I'm sure I've seen other similar page scans.

I'd rather be descriptive than prescriptive.

In any case, if stand-off markup is used (as in BRAT), it should hopefully not be too difficult to preserve the whitespace present in digital source texts.

samyorode commented 6 years ago

I am not so familiar with this topic, but I have seen ། at the beginning of sentences a lot: this applies to block prints as well as to modern publications.

bildschirmfoto 2018-11-02 um 13 36 57 bildschirmfoto 2018-11-02 um 13 28 16 bildschirmfoto 2018-11-02 um 13 28 01 bildschirmfoto 2018-11-02 um 13 45 20
ngawangtrinley commented 6 years ago

@torma and @samyorode, I'm sorry but don't see a single counterexample in the images you sent. As I said before, the grammar is unambiguous on that point.

Not a counter example: ལྷུག་པའི་དོན་མང་མིང་མཚམས་དང་། །དོན་འབྲིང་འབྱེད་དང་དོན་ཉུང་རྫོགས། །ཚིགས་བཅད་ག་མཐར་ཆིག་ཤད་བྱ། ། ༄༅། །ལྷུག་པའི་དོན་མང་མིང་མཚམས་དང་། །དོན་འབྲིང་འབྱེད་དང་དོན་ཉུང་རྫོགས། །ཚིགས་བཅད་ག་མཐར་ཆིག་ཤད་བྱ། ། Notice there's no shad at the begging and two shad at the end. The yig mgo takes two shad ༄༅། །, while the half can just takes one ༄།.

A counter example that I would qualify as bad formatting would be: །ལྷུག་པའི་དོན་མང་མིང་མཚམས་དང་། །དོན་འབྲིང་འབྱེད་དང་དོན་ཉུང་རྫོགས། །ཚིགས་བཅད་ག་མཐར་ཆིག་ཤད་བྱ། I'm sure you saw that too, and you probably noticed that it's much less common than ལྷུག་པའི་དོན་མང་མིང་མཚམས་དང་། ། དོན་འབྲིང་འབྱེད་དང་དོན་ཉུང་རྫོགས། ། or ལྷུག་པའི་དོན་མང་མིང་མཚམས་དང་།། དོན་འབྲིང་འབྱེད་དང་དོན་ཉུང་རྫོགས།།

Please ask around or check more grammar treatises if you need more assurance. I've had a special interest in layout and formatting as I was in charge of compiling and publishing lothos, medical/astrological books, prayer books in Sherabling for several years. There certainly are contentious points in the world of pecha layout, but this definitely isn't one of them.

As for your second point, the issue is in the token index of dependencies such as here:

heacu commented 6 years ago

@ngawangtrinley i'm really confused now. it appeared that you were asserting that two shads one after another always equal double shad, and that double shad should typographically occur immediately after the end of the sentence, with no whitespace between but whitespace following.

@samyorode and i have provided cases from texts including scans and blockprint where there is a shad that immediately precedes a "sentence" and a shad that immediately follows, with whitespace before the first shad and whitespace after the second shad.

we are not asserting that this is double shad, just that two adjacent shad separating by at most whitespace are not necessarily double shad.

am i missing something?

nh36 commented 6 years ago

Ngawang is right. Two shads can appear in terms of formatting as if one is at the end of something and the other one at the beginning of something, that is a formatting issue only. Two shad's in a row are a double shad in terms of what punctuation is being used. The single shad is one piece of punctuation and a double shad is another and you never have two single shad in a row.

heacu commented 6 years ago

ok, i get the point. with regard to the dependency structure, it is no problem for us to join any sentences that consist of just shad with the preceding sentence, making the shad depend on the same word as the previous shad. that can and will at your suggestion be automated later.

with regard to the formatting, i'm not prepared to take a stance on what is bad or good formatting, and would prefer to take the texts as they come in. i'm happy for the text to resemble the blockprint or whatever in its formatting and don't think that's a problem. so doing stand-off markup we can just leave the whitespace where it is.

perhaps you'd like to normalize sequences of two shad to double shad in unicode, but i'm not certain that we should do that, given that the double shad might be difficult to format according to the digital text that we receive.

on the other hand, if we've already screwed up the input texts by ignoring the positioning of shads then we are already not preserving the input text, and so might as well go with a solution that not only has the dependency structure you suggest but also puts the "two shads" together after a sentence without whitespace separating them.

i'll keep this issue open until the dependency changes have been made, but i don't think we disagree any more so there's no need for further discussion?

nh36 commented 6 years ago

I think in unicode it was stupid to have two shads, precisely for reasons like this. The period and the ellipsis are different punctuation (and are different in unicode) but three periods is easily and uniquely interpretable as an elipsis.

heacu commented 6 years ago

Actually @ngawangtrinley I do have a follow up clarification question for you about citing sentences. Let's say we have:

... །text1། །text2། །text3། ...

We preserve this text and leave it as is visually with whitespace splitting the double shad. We also adjust the dependency in accordance with your suggestion.

My question is - how do we cite the middle sentence if we are extracting it to display on its own. Strictly copying the text we'd get:

text2། །

I'm assuming this is nonsensical because the context which was responsible for pulling the right half of the double shad is absent. I'm assuming that we should write instead:

text2།།

Is that correct? This would not be the only case where citation forms will differ from what's in a text.

ngawangtrinley commented 6 years ago

I'm starting to think that confusion comes from us being use to see spaces as sentence/separators, wherehas in tibetan it's the །. You'll see both text2།། and text2། ། in citations. I think you're right saying that the space is being pulled by the text that follows, but leaving the space in between the shad is fine too. Actually in a pecha you would put the second shad right on the right border of your page, or trace an inner border if there was too much space left.

I 100% agree with Nathan on the fact double shad in unicode doesn't make sense. Shad are used to mark the end of phrases but they also are one of the ways to close all spaces in order not to let go merits away (a religious way of preventing paper waste?). ། ། are used with extensible space at the end of lines together with ་་་ to make sure no opening is left. Other features like borders in pecha format, and the ནོར་བུ་སྤུངས་ཤད། ༑ with jewels to balance the isolated/split syllables at the start of lines. "text1།། text2།།" definitely feels wrong, and might be connected to the thing about not leaving open spaces. Modern book layout completely changed these rules though.

I used to have a small layout manual in Sherabling, that covered some of these points but I can't find it anymore.

heacu commented 5 years ago

I've revised our export algorithm so that a sequence of shads are always attached to the preceding "sentence" rather than the following "sentence". I just need to upload the texts here and then this issue can be closed.