MW Accent - Githubissues

funderburkjim commented 3 years ago

This issue in response to comment/question from AB in his 'missed' document.

Here's the comment:

R(1323,1) ṛtavyà-vat (RV.)
<L>38510<pc>224,1<k1>ṛtavya<k2>ṛtavyâ<e>2 
;; HW accent to be revised in these entries. And the long vowel â to be corrected as small letter a.
;; This has indicated another inconsistency in the data: over 100 occurances of â inside <k2> tag, 
which should've been something like ā<srs>.
;; how many such other issues Jim could guess?

The â in k2 is actually not an accident. Here's the reason as I understand it, although my understanding of accents in general is extremely shallow; I am paraphrasing what I understood Peter Scharf to have meant. In MW, there are two kinds of printed accents -- acute and grave. The acute printed accent corresponds to Sanskrit accent type udAtta, and is by far the more common in MW. The grave printed accent corresponds (as I understood it) to Sanskrit accent type 'svarita' (not anudAtta).

In SLP1 transliteration, the three accents are represented by:

following / (e.g. a/) for udatta
following \ (e.g. a) for anudatta
following ^ (e.g. a^) for svarita.

Thus, the 'a^' in the SLP1 spellings of 'k2' represent a+svarita, which is, according to Peter as I understood him, the correct interpretation of what MW represents as a+grave.

Note that this seems consistent with the usage described in Whitney Grammar Section 83

The further wrinkle regards the correspondence between SLP1 and IAST. We (I) take the authority for IAST representation of Devanagari to be https://en.wikipedia.org/wiki/International_Alphabet_of_Sanskrit_Transliteration. In this source, there is no mention of accents. In the Cologne site, we have chosen to represent IAST accents by

udatta : acute accent (preformed unicode where possible)
anudatta : grave accent (preformed unicode where possible)
svarita : circumflex accent (preformed unicode where possible)

To summarize the svarita accent case in MW:

What looks in print as a grave accent in MW is really a representation of a Sanskrit svarita accent.
The accepted way to represent svarita accent in SLP1 is with a trailing circumflex (e.g. a^)
The general (non-MW/Whitney) way to represent svarita accent in our extension of IAST is with circumflex (e.g. â).

funderburkjim commented 3 years ago

Note: In the process of corrections to MW, we have introduced a small number of 'grave accents' à. Based on the discussion above, these should be changed to circumflex: â

<L>23899<pc>137,1<k1>ādya<k2>ādyá<e>2B
<s>ādyá</s> ¦ (for <hom>2.</hom> <s>ādyà</s> See <ab>s.v.</ab>)
<LEND>

<L>176844<pc>875,1<k1>rājanya<k2>rājanyà<e>2
<s>rājanyà</s> ¦ <lex>mf(<s>ā̀</s>)n.</lex> kingly, princely, royal, <ls>RV.</ls> &c. &c.<info lex="m:f#A:n"/>
<LEND>

And I am doing so in next revision of mw.

Andhrabharati commented 3 years ago

Agreed that slp1 for à is â.

Now my point changes thus- why are those slp1 characters remained in <k2> and <s> strings? They should have been converted to (proposed) IAST, isn't it?

Supposing that â is the proposed IAST form, why wasn't the à chosen instead?

funderburkjim commented 3 years ago

No, slp1 for à is a^

The iast for a^ is â.

funderburkjim commented 3 years ago

How does Katre represent a-svarita ?

Andhrabharati commented 3 years ago

Anyway, I should leave it to you, to decide and continue further.

[All such differences would have to be noted separately for our (AB) use.]

Andhrabharati commented 3 years ago

How does Katre represent a-svarita ?

Just like the MW print.

In fact all the books that I've seen are having it only thus.

funderburkjim commented 3 years ago

Can you generate a+macron+combining-circumflex?

funderburkjim commented 3 years ago

Also, does Katre represent a-anudatta? If so, is it different from his a-svarita?

Andhrabharati commented 3 years ago

PFA the page from Katre-

Andhrabharati commented 3 years ago

We can generate any combination of diacs using this link-

http://titus.uni-frankfurt.de/unicode/unicsel/unicself.htm

I am using this for the Greek and other scripts as well.

Andhrabharati commented 3 years ago

E40D LATIN SMALL LETTER A WITH MACRON AND CIRCUMFLEX ABOVE 0061 + 0304 + 0302 ā̂

funderburkjim commented 3 years ago

Good link. Thanks: ā̂

Andhrabharati commented 3 years ago

As I know the anudātta is marked with ◌॒ (Unicode: U+0952).

I need to see if Katre has used it anywhere.

Andhrabharati commented 3 years ago

This is what Katre has in his Pāṇini’s Aṣṭādhyāyi-

Nothing mentioned about this in his Dictionary of Pāṇini.

funderburkjim commented 3 years ago

alphabet_accent is a reference for the current correspondence between slp1 and IAST.

@Andhrabharati Note especially the vowel+diacritics -- If you cut and paste from these, it will simplify my work in transcoding back to slp1.

Andhrabharati commented 3 years ago

About these accents, now I would like to bring to your (@funderburkjim) notice the following-

Peter & Malcom's Linguistic Issues in Encoding Sanskrit [https://sanskritlibrary.org/Sanskrit/pub/lies_sl.pdf] says thus (pp. 16-17)-

1.4 Roman transliteration

... Of particular importance as regards standardization of the schemes used by European scholars was the Geneva Oriental Congress of 1894 (Wujastyk, 1996). Contemporary schemes for Romanizing Sanskrit are quite similar to those employed in the nineteenth century and are characterized by the following conventions: ...

Acute and grave accent marks indicate the udātta and independent svarita accents, respectively (yé, kvà); the dependent svarita (ī in agním īḷe) and the anudātta (naḥ) accent are usually left unmarked.

-------------------------- And in the App. C shows all these x^ (slp1) as x̀ (Roman)

Hope with this, @funderburkjim would now think of changing the accents as mentioned by the creators of slp1 themselves, which is the way I was using in all my remarks/comments.

-------------------------- Note: The Geneva Oriental Congress of 1894, is where the IAST has took its birth. [Lesson. Wiki has made "easy access" to many articles and much info from many corners of the world; but in too many cases, one has to cross-check them instead of taking them "for granted".]

Andhrabharati commented 3 years ago

And now the summary of these accents as seen in the mw_iast file posted by Jim today.

|a^|â|LATIN SMALL LETTER A WITH CIRCUMFLEX| count: 4303 |a\|à|LATIN SMALL LETTER A WITH GRAVE| count: 10

|i^|î|LATIN SMALL LETTER I WITH CIRCUMFLEX| count: 141 |i\|ì|LATIN SMALL LETTER I WITH GRAVE| count: 0

|u^|û|LATIN SMALL LETTER U WITH CIRCUMFLEX| count: 31 |u\|ù|LATIN SMALL LETTER U WITH GRAVE| count: 1

|f^|ṛ̂|LATIN SMALL LETTER R WITH DOT BELOW + COMBINING CIRCUMFLEX ACCENT| count: 0 |f\|ṛ̀|LATIN SMALL LETTER R WITH DOT BELOW + COMBINING GRAVE ACCENT| count: 4

|A^|ā̂|LATIN SMALL LETTER A WITH MACRON + COMBINING CIRCUMFLEX ACCENT| count: 3 |A\|ā̀|LATIN SMALL LETTER A WITH MACRON + COMBINING GRAVE ACCENT| count: 0

|e^|ê|LATIN SMALL LETTER E WITH CIRCUMFLEX| count: 326 |e\|è|LATIN SMALL LETTER E WITH GRAVE| count: 0

|o^|ô|LATIN SMALL LETTER O WITH CIRCUMFLEX| count: 226 |o\|ò|LATIN SMALL LETTER O WITH GRAVE| count: 0

[Probably some of these could be in non-<s> strings, like <etym> etc.]

gasyoun commented 3 years ago

Note: The Geneva Oriental Congress of 1894, is where the IAST has took its birth.

IAST we used and initial 1894 IAST is not equal, still close.

funderburkjim commented 3 years ago

There were a few vowel-grave instances in mw_iast.txt; I've changed these to vowel-circumflex in local version. There are now about 200 instances of svarita accent (represented with vowel-circumflex) in mw_iast.txt, occurring in

metaline k2 field
As text within <s> tag.

The rest of the vowel-circumflex AB notes above occur in

<ls> tags -- these are inherent IAST (not converted to-from slp1); they are believed to be instances of MW's vowel-sandhi usage of circumflex.
<s1 slp1=".*?">[^<]*â (text within <s1> tags - again representing vowel-sandhi
text within <etym> tags (small number; representation of other languages)
attributes within a few 'local' abbreviations.

funderburkjim commented 3 years ago

The comments from Peter and Malcolm's book are helpful.

In addition to supporting the vowel-grave IAST representation of svarita accents in MW, notice that there is no distinctive IAST representation for anudatta accents.

Thus, if we apply that algorithm for representing Sanskrit in IAST to a text which has anudatta accent, then we cannot retrieve the original accented text from its IAST form. To my way of thinking, this is a weakness of IAST representation.

In the case of MW, we assume that there are no anudatta accents. With this assumption, we could construct the IAST version using grave-accent to represent Sanskrit svarita accent; and because there are no anudatta accents, reconstruct accurately the slp1 text of mw.txt from mw_iast.txt.

Given the paucity of instances (200+) of svarita accents under discussion in MW, the manner of IAST representation of svarita in mw_iast.txt does not affect much.

If AB insists, I can change the transcoding of mw_iast.txt so that the slp1-svarita circumflexes are represented in mw_iast.txt by grave accents rather than the current circumflex accents.

Andhrabharati commented 3 years ago

I would be more than glad to see that happen. (And also update the alphabet_accent file.)

And probably we can think of "extending" the Cologne version of IAST by using the unicode character I've mentioned as above.

As I know the anudātta is marked with ◌॒ (Unicode: U+0952).

This is what the printed books (that I've seen) use, though it is not in IAST.

Andhrabharati commented 3 years ago

In the absence of "normative standards", one can follow the "industry standards"!!

funderburkjim commented 3 years ago

I would be more than glad to see that happen

OK. Will aim for that.

Will do this after incorporating your next 'check the updated work all over again' step in #101.

Andhrabharati commented 3 years ago

I missed Jim's reference to Whitney's Grammar above.

Just like to say that he apprently has "stopped" at the beginning of the article 83; should've gone a little further to (a) in there!!

funderburkjim commented 3 years ago

Agree that Peter's description consistent with Whitney. Whitney also doesn't mention anudAtta.

funderburkjim commented 3 years ago

revision of transcoding to IAST

alphabet_accent1.md has a suggested revision of slp1-iast transcoding. The differences from alphabet_accent.md are:

slp1 svarita accents are represented with 'grave accent diacritic' in IAST
slp1 anudatta accents are represented with 'combining low line' in IAST
- The unicode character U0332 (combining low lline) is similar visually to U+0952 (Devanagari Stress Sign Anudatta), but, being in the Unicode combining diacritics block (u03xx) seemed a better choice for representing anudAtta in IAST in a reversible way.

Assuming we agree on this revised IAST correspondence to SLP1, I'll use it for the next revision of mwtranscode/mw_iast.txt.

gasyoun commented 3 years ago

we agree on this revised IAST correspondence to SLP1

We sure do.

Andhrabharati commented 3 years ago

yes; U0332 is a proper choice, as the point is about having the diac for Roman letters and not for Devanagari.

and this could be applied to all cologne digitisations, not just MW99.

Andhrabharati commented 3 years ago

@funderburkjim

Just noticed this towards the end of both the accent files you posted- |\||łh|LATIN SMALL LETTER L WITH STROKE + LATIN SMALL LETTER H|

is this so from the beginning? -if so, how is | present in the mw files?

or just modified now? -so the issue discussed in MWS #88 would/should be resolved by this. -and this would make it a Cologne version of slp1 now!!

I would now request you to consider changing L and this \ to indicate ḷ and ḷh (or l̥ and l̥h), instead of ł and łh.

Many transliteration softwares are designed to produce ḷ or l̥, and any difference would require a special/extra effort to key-in those characters, just for Cologne data.

We should travel the way many people do, unless there is a pressing need to do otherwise.

Andhrabharati commented 3 years ago

This small document talking about Vedic accents may be of some interest to go through once.

Vedic_accents_doc.pdf

Andhrabharati commented 3 years ago

Found another document exclusively talking about Skt. dictionaries and accents.

Rau2017_vedic-accent-in-lexicography.pdf

Is the present Sankrit-lexicon team having any links with this Lazarus project?

And is there a way for accessing this github.com/sanskrit-lexicon/MWS/files/... folder? Looks it might contain many interesting/informative documents like this. Or is that a "Private area"?

gasyoun commented 3 years ago

Is the present Sankrit-lexicon team having any links with this Lazarus project?

Very minor, still there has been some contact in the past with @fxru

funderburkjim commented 3 years ago

In SLP1, 'L' represents consonant ळ (Unicode Devanagari LLA).

In SLP1, '|' represents conjunct consonant ळ्ह

The above are my understanding.

IAST representation of these are not found in https://en.wikipedia.org/wiki/International_Alphabet_of_Sanskrit_Transliteration.

However, this source does mention another 'standard' ISO 15919. And mentions that l̥ is used in ISO 15919 to represent 'vocalic l' (= slp1 'x')

l̥ = LATIN SMALL LETTER l + 'COMBINING RING BELOW' (U+0325)

By contrast, IAST uses ḷ = LATIN SMALL LETTER L WITH DOT BELOW

And the same article mentions that ISO 15919 uses the same ḷ (= LATIN SMALL LETTER L WITH DOT BELOW) to represent Devanagari ळ.

At the start of our work with @Andhrabharati , it was decided to make an IAST version of mw_iast. txt for him. And since it was necessary to be able to convert between the IAST version and the 'native' SLP1 version mw.txt, I had to develop some unambiguous code for the IAST representations of slp1 'L' and '|'.

I chose 'ł' and 'łh' to be IAST representations of slp1 'L' and '|'.

Our Cologne software handles the conversions.

Since there is no standard, these choices are as good as any.

If a standard ever emerges in the future, we can revise.

funderburkjim commented 3 years ago

github.com/sanskrit-lexicon/MWS/files/ folder

This url is not available.

It may be that when you drag a file into a comment, Github puts it into this url.

For example in #83, @Andhrabharati dragged a file 'changes_0_Andhrabharati.txt' into a comment. and this file is now available (for download only) as url: https://github.com/sanskrit-lexicon/MWS/files/5750131/changes_0_Andhrabharati.txt

So there is no 'files' directory per se. It's just a convention of Github. We are making use of no 'Private areas'. Everything out in the open for sanskrit-lexicon.

Andhrabharati commented 3 years ago

I guess, this issue got enough (and necessary) attention and is discussed upon, and can be closed now.

sanskrit-lexicon / MWS

MW Accent #103

1.4 Roman transliteration

revision of transcoding to IAST