Misc. corrections - Githubissues

funderburkjim commented 2 years ago

Review starts with 'BUR corrections.txt' prepared by @Andhrabharati (ref.).

Some of these I'm not sure of. Will ask Odile.

funderburkjim commented 2 years ago

first changes

Work preparing changes is done in issue3 directory. The changes made this far (see change_1.txt)

(1) 19165 A'ler Aller
(94) ạ. a2. ; mis-interpration as LN (AS) encoding
- 94 instances include ạ and Ạ.
- these are changed to <ab>a2.</ab> and <ab>A2.</ab> , abbreviations in Burnouf for 2nd aoriste.
Ạ. A2. ; mis-interpration as LN (AS) encoding
(22) (s) ḥ s ; noticed some (possibly, more) marked as plain 's'
- the parentheses removed, per agreement with text
- These occur in Sanskrit words. and some are debatable, for instance sukhaduskānām (see change_1.txt).

** || not changed The suggested change || ‖ ; it is 'double' bar, not two bars! is not made.
The suggested replacement is the 'double vertical line \u2016. The two-vertical-line markup is common in the printed text of Burnouf:

This '||' in bur.txt gets converted to a 'div' in make_xml.py: x = x.replace('||','<div n="3">'), which in turn generates a line break with indentation in the html of displays:

I see no reason to change the way this feature of the print is represented in bur.txt.

funderburkjim commented 2 years ago

numbers with superscript text

The printed text typically uses superscripted-letters following digits to represent ordinal numbers (English 1st or first, 2nd or second, etc.).
In the digitization, these are currently represented by non-superscript-letters.
@Andhrabharati suggests changing to Unicode superscripted letters, at least in the following:

*   ([0-9])o    \1॰
  * should the trailing o be a trailing 'e'  (print change)
*   1re 1ʳᵉ
*   1er 1ᵉʳ
*   ([0-9])e    \1ᵉ
*   8c  8ᵉ   NOTE: the 'c' is a typo, it was changed to 8e in change_1.txt above
*   2d  2ᵈ

In examining with Google Translate, I found that the non-superscript-letters work better than the superscript letters. There is also a separate question regarding the following 'o', which maybe should be instead a following 'e'.

ae, AE

The printed text typically uses the unicode æ (Latin Small Letter Ae) and its capital letter form (Ref: https://en.wikipedia.org/wiki/%C3%86).

*   ae  æ   ; scope is non-Sanskrit words, wherein it is to be 'ai'.
*   AE  Æ   ; scope is non-Sanskrit words

prep2.txt

I am uncertain whether the above mentioned changes should be made.
I think the decision should be based on current French language usage.

The prep2.txt file provides instances of these from the bur.txt digitization.

I am soliciting Odile's opinion on this.

Andhrabharati commented 2 years ago

(22) (s) ~ḥ~ s ; noticed some (possibly, more) marked as plain 's'

* the parentheses removed, per agreement with text

* These occur in Sanskrit words. and some are debatable, for instance `sukhaduskānām` (see change_1.txt).

Look at this from p.2, to see the difference between the ’s’ [स -sa] and the '(s)' [ः -visarga] that I was talking about--

It is not सुखदुस्कानाम् sukhaduskānām, but सुखदुःखानाम् sukhaduḥkhānām that is being quoted here. [There is a print error (missing aspirant (h) mark after k, the modifying reversed comma (ʽ) u+02BD), making it kʽ at duḥkhā; this mark can be seen at the word beginning - sukʽa (sukha), though.]

Also note the s and (s) marked at the same citation--

Andhrabharati commented 2 years ago

And I had looked carefully at all the places, before putting my observations.

Andhrabharati commented 2 years ago

This is about the double bar marking (p.3)--

It is, by no means, two vertical bars!!

Andhrabharati commented 2 years ago

Now I guess, the need to look at the front matter and understand the author's mind would be appreciated by the CDSL team!!

Andhrabharati commented 2 years ago

And @funderburkjim you might consider taking other language strings (etym.) as well, like you did for greek strings.

I was suggesting this earlier, and have posted my full file also few days back. https://github.com/sanskrit-lexicon/csl-devanagari/issues/37#issuecomment-1115119959

funderburkjim commented 2 years ago

(s) -> visarga

Corrected the 21 lines where I had erroneously changed '(s)' to 's', I now instead change to IAST visarga 'ḥ'. See change_1a.txt.

Due to the subtlety of the print difference between Burnouf's visarga and the 's', there are no doubt still cases in bur.txt where an 's' should be replaced by ḥ. Will leave it to @Andhrabharati to find these cases and communicate them to me in an easy to use way.

La double bar verticale is equally well represented by either form. No change needed in bur.txt.

Andhrabharati commented 2 years ago

first things first: is my print error suggestion sukhaduḥkhānām at <L>11210 not convincing enough?

funderburkjim commented 2 years ago

Sure - that is convincing enough. If you provide a list including other cases like sukhaduḥkhānām at <L>11210, I'll make the changes in bur.txt.

Andhrabharati commented 2 years ago

I see 23 (s) marks places in the csl-orig text and all are attended to now, though it has been mentioned as 22 cases by Jim above.

I see the 'ḥ' occurrences in my posted full file and wish Jim would look at my file for these now, if not for other etym. strings!!

funderburkjim commented 2 years ago

I can harvest the ḥ from bur.AB.ver.-v2.txt file mentioned above.
regarding 'other etym. strings' -- Are you talking about the text after <ab>germ.</ab> , <ab>lat.</ab>, etc in bur.txt? Are you saying that you corrected some of these? If so, do you have a list of the ones that were corrected?

Andhrabharati commented 2 years ago

I also wish that Jim looks at my abbr. list posted few months back- https://github.com/sanskrit-lexicon/csl-devanagari/issues/37#issuecomment-1030889096

It still has more abbr.s marked than in the latest csl-orig bur.txt.

Andhrabharati commented 2 years ago

Yes; I remember correcting some (if not all) of those etym. strings & I did not make any separate list (as done specifically for greek strings against your proposal earlier) of those. https://github.com/sanskrit-lexicon/csl-devanagari/issues/37#issuecomment-1026042472

As I am at some other work, not in a mood to spend time on listing these now.

gasyoun commented 2 years ago

It still has more abbr.s marked than in the latest csl-orig bur.txt.

Thanks for all those reminders as well.

Andhrabharati commented 2 years ago

; noticed some (possibly, more) marked as plain 's'

Will leave it to @Andhrabharati to find these cases and communicate them to me in an easy to use way.

Just thought of searching for the additional 'ḥ' places through regex, and got 35 more (on a quick search).

Also replaced the 20 '(s)' cases remaining in my file [probably I did not save my work after correcting them those days.].

Now, the total count is 112, as against the original csl file (as of Oct 2021) count of 29.

And, incidentally many of the new finds are of 'duḥkha' or its inflectional forms (some with the same print error of missing aspirant mark).

Here is my latest file-- bur (AB ver.) -v3.txt

Andhrabharati commented 2 years ago

It still has more abbr.s marked than in the latest csl-orig bur.txt.

Thanks for all those reminders as well.

It's in my own interest (!!); if my work is not used/useful, what is the point in posting the files here? [So, occasional reminders (direct or indirect) are being made.]

Andhrabharati commented 2 years ago

One final point, before I move out from this thread--

Jim might probably think of making the metalines and fill the devanagari strings for the new HWs, marked in my file with +++. [A simple and quick task, taking practically no time.]

They are considerable in count, nearly 15k of them, and would make BUR the next work at CDSL (apart from MW99) to have composite words elevated to HW-level.

funderburkjim commented 2 years ago

Visarga corrections installed.

Corrected visargas in bur.txt, based on 'bur (AB ver.) -v3.txt'.

Includes 4 suggested revisions to the v3 file: v3changes.

Now 115 instances of visarga ḥ.

Andhrabharati commented 2 years ago

looked for 'dusk’: <L>15378-- {%sukhaṃ duskaṃ vā%} > {%sukhaṃ duḥkhaṃ vā%} <L>19339-- {%smariṣyati kaucalyāṃ suduskitām%} > {%smariṣyati kauśalyāṃ suduḥkhitām%}

funderburkjim commented 2 years ago

finish abdata.txt changes

Received reply from Odile (see odile.txt). Basically she concurred with the changes suggested by @Andhrabharati. And these have been installed into bur.txt. See change_3.txt.

There were a few cases where the æ text was reverted to 'ae', in agreement with the printed text; @Andhrabharati may want to revise in his -v3 version. Also, the two 'dusk' changes mentioned by AB above were implemented. In one case ('Naerrita' under brahma) , the diphthong form Nærrita was retained, in disagreement with Odile -- I think the printed text confirms.

One item marked as print change (Gærî -> Gaorî) under vfzAkapi headword.

My next step will be to examine the extra <ab> markup of -v3.

gasyoun commented 2 years ago

My next step will be to examine the extra markup of -v3.

Hope you do not get stuck in BUR for too long. No many Frenchmen between us ))

Andhrabharati commented 2 years ago

Glad that my corrections are taken into consideration.

It appears that Jaena is corrected, as suggested by Odile.

Just like to mention that 'æ' is Burnouf's transcription for 'ai' in sanskrit, and is used thus at all places

I had mentioned it in my correction file rather NOT SO clearly [ae >æ; scope is non-Sanskrit words, wherein it is to be 'ai'], but seems it got skipped.

Using 'ai' (Sanskrit) as 'aï' (Prakrit) at these places is uncalled for!

See how it would look if done so, जैनvs. जइन.

In one case ('Naerrita' under brahma) , the diphthong form Nærrita was retained, in disagreement with Odile

This being a Sanskrit word (नैर्ऋत), it comes under my Skt. list and needs to be made Nairṛita.

One item marked as print change (Gærî -> Gaorî) under vfzAkapi headword.

In consistence with the practice of cdsl text transliteration, the Gaorî should more appropriately be made 'Gaurī' ['au' = औ ; ’ao' being Burnouf's transcription].

funderburkjim commented 2 years ago

Nærrita

Agree with you. Changing to use the s1 markup:

OLD
<P>{%maṇimaṇḍapa%} <ab>m.</ab> <ab>mms.</ab> || Le palais de Nærrita, régent du sud-ouest
NEW
<P>{%maṇimaṇḍapa%} <ab>m.</ab> <ab>mms.</ab> || Le palais de <s1 slp1="nErfta">Nairṛita</s1>, régent du sud-ouest

funderburkjim commented 2 years ago

Gaorî

Here is the printed text:

I think this is to be regarded as a Sanskrit proper name (sanskrit word 'gErI' (slp1)). And thus rendered as below

<P>{%vṛṣākapāyī%} <ab>f.</ab> l'épouse d'<s1 slp1="agni">Agni</s1>; <s1 slp1="lakzmI">Lakṣmī</s1>; <s1 slp1="gErI">Gairī</s1>;

@Andhrabharati agree?

Andhrabharati commented 2 years ago

No, it has to be Gaurī only, taking it as a print error.

See the word वृषाकपायी in PWG, wherein the meanings include श्री (Lakshmī), गौरी and शची.

funderburkjim commented 2 years ago

Jaïnas

I now think this should be represented with the 's1' tag similar to above. For example, under headword kArtavIrya:

OLD
<ab>np.</ab> d'un roi symbolique chez les Jaïnas.
NEW:
<ab>np.</ab> d'un roi symbolique chez les <s1 slp1="jEna">Jaina</s1>s.

@Andhrabharati Agree?

Andhrabharati commented 2 years ago

Yes, and it should be done for all occurrences of Jaena and jaena.

funderburkjim commented 2 years ago

gOrI

Got it. Final form:

<P>{%vṛṣākapāyī%} <ab>f.</ab> l'épouse d'<s1 slp1="agni">Agni</s1>; 
<s1 slp1="lakzmI">Lakṣmī</s1>; <s1 slp1="gOrI">Gaurī</s1>;

funderburkjim commented 2 years ago

Jainas

Changed Jainas to <s1 slp1="jEna">Jaina</s1>s in 13 instances

Changed jæna to <s1 slp1="jEna">jaina</s1> 10 instances

funderburkjim commented 2 years ago

abbreviation markup enhancement.

Add <ab> markup to bur.txt, mined from Andhrabharati's 'v3' version of burnouf dictionary, and a couple of suggestions from Odile.

prep5_bur4.txt compares the <ab> markup present in the two sources, before resolution of differences.
- the first number shows the count in bur.txt of a given abbreviation
- the second number shows the count in v3.txt
- the counts are marked either as '==' (the same) or '!=' (not equal, different).
prep5_bur5.txt shows the counts AFTER the resolution of differences.
change_5.txt shows the changes to bur.txt. 2777 lines of bur.txt were changed.
change_v3_edit1_edit2.txt shows changes made to v3. 100 lines of v3 were changed.

Note to @Andhrabharati within the change_v3 file are 23 lines with [[xxx]]. These were cases where some text was discovered to be missing. The missing text was always after <lbinfo/> markup in bur.txt. There may be other 'missing text' not caught by this <ab> analysis.

next step

Abbreviation 'tooltips' need to be developed. Currently, 159 abbreviations have tooltips. And there are 287 abbreviations shows in prep5_bur5.txt. So 128 abbreviations need tooltips.

Andhrabharati commented 2 years ago

Note to @Andhrabharati within the change_v3 file are 23 lines with [[xxx]]. These were cases where some text was discovered to be missing. The missing text was always after <lbinfo/> markup in bur.txt. There may be other 'missing text' not caught by this <ab> analysis.

Thanks for identifying the missing cases ["was always after <lbinfo/> markup"].

Seen that it is due to my 'hasty' regex error [had used <lb(.*)/>: 9407, instead of <lb(.*?)/>: 9563 ] while removing the tags; there are 156 such lines in the csl-orig file and all were missed from my first version itself. [I was not expecting two line-breaks in a single line!!]

drdhaval2785 commented 1 year ago

Time to close, @funderburkjim ?

sanskrit-lexicon / BUR

Misc. corrections #3

first changes

numbers with superscript text

ae, AE

prep2.txt

(s) -> visarga

Visarga corrections installed.

finish abdata.txt changes

Nærrita

Gaorî

Jaïnas

gOrI

Jainas

abbreviation markup enhancement.

next step