Closed funderburkjim closed 2 years ago
Although there are over 10,000 instances of Greek text fragments, there are only 135 different texts. See freq_greek_csl.txt derived from the Cologne version csl-orig/v02/inm/inm.txt. And freq_greek_ab.txt derived from Andhrabharati's version.
The suggested change to inm.txt would be consistent with AB's version.
The only text which is NOT considered a label or star name and which would retain the <lang>
tag markup,
occurs under headword 'aSvaka', page 8:
137, 142 (probably the {%<lang n="greek">Ἀσσακηνοί</lang>%} of the Greeks, in eastern
Certainly this 'removing of tags in references' would be the proper way to do, @funderburkjim!
Also see my post in PWG, for a related point- https://github.com/sanskrit-lexicon/PWG/issues/56#issuecomment-1138110232
See freq_greek_csl.txt derived from the Cologne version csl-orig/v02/inm/inm.txt. And freq_greek_ab.txt derived from Andhrabharati's version.
Had a look at these two fiiles; seems my file has more counts than csl, in 27 places--
α 1594 β 896 γ 850 γγ 60 δ 555 δδδ 20 ζ 492 ζζ 57 ζζζ 19 η 588 ηʹ 9 θ 597 θθ 49 ι 259 ιι 46 κ 335 λ 363 μ 373 ν 345 ξ 189 ο 254 π 168 ρ 163 σ 116 χ 153 χχχ 8 ϕ 94*
Any clue for the differences?
And my 'posted' INM file has Ἀσσακηνοί (which got duly copied to csl, I guess), not σσακηνοί as mentioned in your freq file.
lang markup removed in csl version of inm.txt.
The case with Ἀσσακηνοί .
Ἀ is unicode code point \u1f08.
In freq_greek.py program, I had used only the unicode range \u0370 - \u03ff for Greek.
However, from this reference, I learn that
\u1f00-\u1fff should also be included.
The reason that Ἀσσακηνοί appeared in the csl frequency list is that greek was identified by the
xml markup <lang n="greek">X</lang>
, rather than by unicode ranges.
The program was modified and the frequency count files recomputed. Only 1 difrerence in the freq_ab file:
$ diff freq_greek_ab.txt tempprev_freq_greek_ab.txt
108a109
> σσακηνοί 1
135d135
< Ἀσσακηνοί 1
Any clue for the differences?
In brief, the differences are due to the inclusion of Addenda material in the AB version.
Recall from #4 that Greek text was added to csl-orig/v02/inm/inm.txt based on Andhrabharati work, and that a few modifications to AB's work was made to facilitate the additions. The 'final' revision of AB work is in inm_slp1_L2_02.txt.
The compare_greek.txt file shows a further analysis of differences between the csl and ab greek text occurring in inm dictionary.
Because of the addenda and corrections, the csl and ab versions have different entries. The csl version has 12647 entries, and the AB version has 13485 entries. There are 844 AB entries whose L-numbers have a '.' - all of these are additions to CSL entries. There are 6 CSL entries which are dropped in AB. When we exclude these in the respective dictionaries, we are left with (13485 - 844) = 12641 in AB and (12647 - 6) = 12641 in CSL. AND these 12641 entries correspond with respect to L-numbers.
Now, what compare_greek text does is to compare the sequences of Greek text instances in each of these common 12641 entries in the the two versions. We find that there are 53 entries with differences in the Greek text instances.
Recall a feature of the modified AB version mentioned above. There is a markup <gx>X</gx>
used
to indicate greek text that was part of an insertion of ADDENDA text. There are 49 instances. Note that this is ALMOST the same as the number (53) entries with differences.
The first 2 differences from compare_greek are seen to be explained by a gx.
A variation compare_greek1.txt was prepared to help distinguish which of the 53 are explained by gx. A visual examination shows that, based on counting, the entries with gx items are explained by the gx items.
but there are 9 entries with differences where there is no gx entry in the ab version. (ab gx text: []
)
The differences in greek texts are explained variously:
homonym split in ab:
L=1081 bAhu
L=1176 balavat
L=1440 BayaNkara
L=8152 pArijAta
L=9478 sANKya
L=10996 tryambaka
L=11588 vareRya
other split in ab
L=4092 gOtamI
AB updated from Addendum is probable explanation, though too many
Greek texts to readily check.
L=1510 BIzmavaDaparvan
The Greek text in these 'extra' L-codes of AB IS included in the freq statistics for ab version. So this is another source of differences.
This concludes my review of the differences.
https://github.com/sanskrit-lexicon/INM/issues/4#issuecomment-991433302
One 'temporary' detail is the presence of several
X items in the L2 version. These mark Greek texts originating from the additions-corrections section of INM. Since I was not aiming to include those in the present work, it was necessary to identify them for alignment of other greek texts.Because of the addenda and corrections, the csl and ab versions have different entries.
@funderburkjim, No intention (still) to add the addendum greek entries to INM, as was done in BEN recently?
Currently, in the csl version, there is a separate file inm_ac.txt containing digitization of 'ADDITIONS AND CORRECTIONS'.
In this file, there are 97 <lang n="greek"></lang>
instances.
One of these is in the CONCORDANCE subsection, and the rest are in the 'INDEX TO THE NAMES' section(s).
It would be good to fill in these greek instances in inm_ac.txt.
@Andhrabharati Perhaps you would download inm_ac.txt and fill in the Greek and return to me?
Similarly, there are 6 Greek text fragments in inm_concord.txt. And these should also be filled.
Similarly, there are 7 Greek text fragments in inm_preface.txt. And these should also be filled.
If you are so kind as to do this, you could also remove the <lang n="greek">
and </lang>
markup for cases which are labels.
Once this is done, the Greek text labels in INM will be completed in the basic csl digitization.
OK, will do as asked. [Looks like you're exhausted with looking at my file(s)!]
Here are the three files filled with Greek letters; the punctuation etc. is untouched (which could be done consistent with the main text style). inm_ac (AB).txt inm_concord (AB).txt inm_preface (AB).txt
@Andhrabharati Thank you! Your files now installed in csl-orig/v02/inm/.
@Andhrabharati Thank you! Your files now installed in csl-orig/v02/inm/.
It appears that a file submission without much formatting changes is far-far easier to handle for you; but I rarely limit myself to that!
Yes, that is true.
Re the {%E.g%}.
mistake -- any others I should look for?
I had randomly spotted it on a quick look; let's leave the rest for now.
remove lang tag ?
Almost every Greek text in inm.txt is either (I think)
Example of reference. Under headword Arjuna, page 10b
Coding change suggested in inm.txt:
Example of usage in star name Under headword aBijit Page 1
Suggested coding change: