Greek text remove lang tag

funderburkjim commented 2 years ago

remove lang tag ?

Almost every Greek text in inm.txt is either (I think)

part of a star name
a reference

Example of reference. Under headword Arjuna, page 10b

Coding change suggested in inm.txt:

OLD
(together with others he attacks Arjuna); {@88@}, <lang n="greek">αγʹ</lang>, <lang n="greek">αδʹ</lang>
NEW
(together with others he attacks Arjuna); {@88@}, αγʹ, αδʹ

Example of usage in star name Under headword aBijit Page 1

Suggested coding change:

OLD
Aśvinī (star of junction Vega or <lang n="greek">α</lang> Lyræ; see Whitney to Sū°
NEW
Aśvinī (star of junction Vega or α Lyræ; see Whitney to Sū°

funderburkjim commented 2 years ago

Although there are over 10,000 instances of Greek text fragments, there are only 135 different texts. See freq_greek_csl.txt derived from the Cologne version csl-orig/v02/inm/inm.txt. And freq_greek_ab.txt derived from Andhrabharati's version.

The suggested change to inm.txt would be consistent with AB's version.

The only text which is NOT considered a label or star name and which would retain the <lang> tag markup, occurs under headword 'aSvaka', page 8:

137, 142 (probably the {%<lang n="greek">Ἀσσακηνοί</lang>%} of the Greeks, in eastern

Andhrabharati commented 2 years ago

Certainly this 'removing of tags in references' would be the proper way to do, @funderburkjim!

Also see my post in PWG, for a related point- https://github.com/sanskrit-lexicon/PWG/issues/56#issuecomment-1138110232

Andhrabharati commented 2 years ago

See freq_greek_csl.txt derived from the Cologne version csl-orig/v02/inm/inm.txt. And freq_greek_ab.txt derived from Andhrabharati's version.

Had a look at these two fiiles; seems my file has more counts than csl, in 27 places--

α 1594 β 896 γ 850 γγ 60 δ 555 δδδ 20 ζ 492 ζζ 57 ζζζ 19 η 588 ηʹ 9 θ 597 θθ 49 ι 259 ιι 46 κ 335 λ 363 μ 373 ν 345 ξ 189 ο 254 π 168 ρ 163 σ 116 χ 153 χχχ 8 ϕ 94*

Any clue for the differences?

And my 'posted' INM file has Ἀσσακηνοί (which got duly copied to csl, I guess), not σσακηνοί as mentioned in your freq file.

funderburkjim commented 2 years ago

lang markup removed in csl version of inm.txt.

funderburkjim commented 2 years ago

The case with Ἀσσακηνοί .

Ἀ is unicode code point \u1f08.
In freq_greek.py program, I had used only the unicode range \u0370 - \u03ff for Greek.
However, from this reference, I learn that \u1f00-\u1fff should also be included.

The reason that Ἀσσακηνοί appeared in the csl frequency list is that greek was identified by the xml markup <lang n="greek">X</lang>, rather than by unicode ranges.

The program was modified and the frequency count files recomputed. Only 1 difrerence in the freq_ab file:

$ diff freq_greek_ab.txt tempprev_freq_greek_ab.txt
108a109
> σσακηνοί 1
135d135
< Ἀσσακηνοί 1

funderburkjim commented 2 years ago

Any clue for the differences?

In brief, the differences are due to the inclusion of Addenda material in the AB version.

Recall from #4 that Greek text was added to csl-orig/v02/inm/inm.txt based on Andhrabharati work, and that a few modifications to AB's work was made to facilitate the additions. The 'final' revision of AB work is in inm_slp1_L2_02.txt.

The compare_greek.txt file shows a further analysis of differences between the csl and ab greek text occurring in inm dictionary.

difference in entries

Because of the addenda and corrections, the csl and ab versions have different entries. The csl version has 12647 entries, and the AB version has 13485 entries. There are 844 AB entries whose L-numbers have a '.' - all of these are additions to CSL entries. There are 6 CSL entries which are dropped in AB. When we exclude these in the respective dictionaries, we are left with (13485 - 844) = 12641 in AB and (12647 - 6) = 12641 in CSL. AND these 12641 entries correspond with respect to L-numbers.

Now, what compare_greek text does is to compare the sequences of Greek text instances in each of these common 12641 entries in the the two versions. We find that there are 53 entries with differences in the Greek text instances.

Recall a feature of the modified AB version mentioned above. There is a markup <gx>X</gx> used to indicate greek text that was part of an insertion of ADDENDA text. There are 49 instances. Note that this is ALMOST the same as the number (53) entries with differences. The first 2 differences from compare_greek are seen to be explained by a gx.

A variation compare_greek1.txt was prepared to help distinguish which of the 53 are explained by gx. A visual examination shows that, based on counting, the entries with gx items are explained by the gx items.

the 9 explained

but there are 9 entries with differences where there is no gx entry in the ab version. (ab gx text: []) The differences in greek texts are explained variously:

homonym split in ab:
L=1081 bAhu
L=1176 balavat
L=1440 BayaNkara
L=8152 pArijAta
L=9478 sANKya
L=10996 tryambaka
L=11588 vareRya

other split in ab
L=4092 gOtamI

AB updated from Addendum is probable explanation, though too many
Greek texts to readily check.

L=1510 BIzmavaDaparvan

The '.' L for ab version

The Greek text in these 'extra' L-codes of AB IS included in the freq statistics for ab version. So this is another source of differences.

This concludes my review of the differences.

Andhrabharati commented 2 years ago

https://github.com/sanskrit-lexicon/INM/issues/4#issuecomment-991433302

One 'temporary' detail is the presence of several X items in the L2 version. These mark Greek texts originating from the additions-corrections section of INM. Since I was not aiming to include those in the present work, it was necessary to identify them for alignment of other greek texts.

Because of the addenda and corrections, the csl and ab versions have different entries.

@funderburkjim, No intention (still) to add the addendum greek entries to INM, as was done in BEN recently?

funderburkjim commented 2 years ago

addendum greek

Currently, in the csl version, there is a separate file inm_ac.txt containing digitization of 'ADDITIONS AND CORRECTIONS'.

In this file, there are 97 <lang n="greek"></lang> instances. One of these is in the CONCORDANCE subsection, and the rest are in the 'INDEX TO THE NAMES' section(s).

It would be good to fill in these greek instances in inm_ac.txt.

@Andhrabharati Perhaps you would download inm_ac.txt and fill in the Greek and return to me?

Similarly, there are 6 Greek text fragments in inm_concord.txt. And these should also be filled.

Similarly, there are 7 Greek text fragments in inm_preface.txt. And these should also be filled.

If you are so kind as to do this, you could also remove the <lang n="greek"> and </lang> markup for cases which are labels.

Once this is done, the Greek text labels in INM will be completed in the basic csl digitization.

Andhrabharati commented 2 years ago

OK, will do as asked. [Looks like you're exhausted with looking at my file(s)!]

Andhrabharati commented 2 years ago

Here are the three files filled with Greek letters; the punctuation etc. is untouched (which could be done consistent with the main text style). inm_ac (AB).txt inm_concord (AB).txt inm_preface (AB).txt

funderburkjim commented 2 years ago

@Andhrabharati Thank you! Your files now installed in csl-orig/v02/inm/.

Andhrabharati commented 2 years ago

@Andhrabharati Thank you! Your files now installed in csl-orig/v02/inm/.

It appears that a file submission without much formatting changes is far-far easier to handle for you; but I rarely limit myself to that!

funderburkjim commented 2 years ago

Yes, that is true.

Re the {%E.g%}. mistake -- any others I should look for?

Andhrabharati commented 2 years ago

I had randomly spotted it on a quick look; let's leave the rest for now.

sanskrit-lexicon / INM