sanskrit-lexicon / BOR

Development of BOR dictionary
0 stars 0 forks source link

comma, semicolon etc inside Devanagari scope #1

Open drdhaval2785 opened 3 years ago

drdhaval2785 commented 3 years ago

Below is the statistics of various intermediate items found out by the following regex "({#[^#]*)([^a-zA-Z0-9 ]+)([^#]*#})"

[(u',', 20966), (u'.', 7697), (u';', 5704), (u':', 2785), (u'-', 2267), (u'*', 639), (u')', 88), (u"'", 75), (u'\u201d', 25), (u'\xa6', 16), (u'\u201c', 14), (u'?', 12), (u'(', 6), (u'#}){#', 4), (u'#}({#', 4), (u'\u2019', 3), (u'!', 3), (u'~', 2), (u'\u2018', 1), (u'+', 1), (u'/', 1), (u'=', 1), (u']', 1)]
funderburkjim commented 3 years ago
9716 matches in 9352 lines for "{#[^#]*[.]" in buffer: bor.txt

Don't know why number differs from (u'.', 7697).

funderburkjim commented 3 years ago

The period is significant

In slp1, the period character '.' represents danda. (and two periods '..' represents double-danda, which in Unicode Devanagari is a separate code point).

in headword 'a' in bor, the first period in {#X##} is at line 21:

<div n="lb"/>{#aSoko vfkzaviSezaH.#} </div><div n="I">IV When used distribu-

A devanagari display shows the period is transformed to danda: image

But in the scan, the period is just a period (English punctuation):

image

In this case, the period should be moved outside of the {#X#}

<div n="lb"/>{#aSoko vfkzaviSezaH#}. </div><div n="I">IV When used distribu-
funderburkjim commented 3 years ago

Move all periods ?

If there is Devanagari text in BOR which really does have a danda, then the corresponding period character in bor.txt should be retained.

So the general answer has to be No, don't move all periods outside of {#X#} in bor.

However, AFAIK, dandas generally appear in Sanskrit verses. And, in BOR, the Sanskrit text seems to be 'short' (This generalization for bor.txt based on random browsing of the 9000+ instances of Sanskrit text with periods).

Thus, for bor, I suspect it is safe to globally move the periods.

funderburkjim commented 3 years ago

Most periods at end

9559 matches in 9200 lines for "[.]#}" in buffer: bor.txt

Almost all the periods in bor Sanskrit text occur just before the ending markup . And these can easily be changed to #}. .

Changing the rest would be a bit trickier, but likely doable by a regex replacement.

funderburkjim commented 3 years ago

Move other characters?

The apostrophe also has significance in slp1 as avagraha. It should NOT be moved outside of {#X#}

Similarly \/^ characters for accents, but bor.txt probably doesn't have these. Also '|' and '~' have significance in SLP1.

Certainly semi-colon and comma have no significance in slp1. Almost all of these in bor.txt sanskrit occur at the end . So simple replacements of ;#} to #}; and similarly for comma would be slight improvements to the coding of Sanskrit.