Greek text review - Githubissues

funderburkjim commented 1 year ago

This is a placeholder issue. In another issue (https://github.com/sanskrit-lexicon/BEN/issues/8), @Andhrabharati points out the need, for mw.txt, to

proofread the Greek text - as some errors noticed incidentally. There are about 1100 instances to review
add Greek text -- some is missing in current mw.txt of cdsl.

When the time is right for review of mw.txt, further comments will be made here.

Andhrabharati commented 1 year ago

I already did the proofing of all greek and other languages' (cognate) strings. And filled up various missings at those places in all the languages.

Also made some better markups for these in my current review work.

funderburkjim commented 1 year ago

@Andhrabharati How should we proceed in order to apply your corrections to cdsl mw.txt?

Andhrabharati commented 1 year ago

I shall post the existing lines vs. my corrected lines, explaining how I changed few markings at those places.

Then it should be easy for you to apply the corrections in the cdsl file, even if not going to change the markings (as per my proposal).

Andhrabharati commented 1 year ago

Only issue I see is that I had removed all the duplicate entries in my working, so you need to "look" for them in cdsl file to repeat the corrections.

Andhrabharati commented 1 year ago

@funderburkjim

Here is the file with details for quick browsing -- lang string changes.pdf

And the summary is -- NO Error lines | 584 Greek related error lines | 91 Other lang. related error lines | 168 Misc. error lines | 441

I tried making a .tsv file with individual error types (at gist.github.com), as you suggested sometime back on a different context -- https://gist.github.com/Andhrabharati/e82889769929d74df0fe35050d734504

Andhrabharati commented 1 year ago

I will be creating a separate issue with more description (at the mw-dev repo), and post the text file for your "use".

Andhrabharati commented 1 year ago

@funderburkjim

Pl. have a look at https://github.com/sanskrit-lexicon/mw-dev/issues/7#issue-1576531888

Andhrabharati commented 1 year ago

@funderburkjim

I just added some explanation on the topic at https://github.com/sanskrit-lexicon/mw-dev/issues/21#issue-1587358520

Hope you'd be taking up the corrections in the existing cdsl text itself (under v02), from various issues I am posting under mw-dev, as much as possible/applicable.

funderburkjim commented 1 year ago

Re lang string changes.pdf

@Andhrabharati Do you have a way to prepare a text file of this pdf? Reason: Using Acrobat 9, the file is quite easy to read, but not to process with programs. A text file (with UTF-8) coding for non-ASCII characters) would be useful, I think. This would allow me to apply corrections. I've tried 'export' to text file with Acrobat, but with no useful results.

Andhrabharati commented 1 year ago

@funderburkjim

I have already posted the text file at mw-dev repo and mentioned the same above at https://github.com/sanskrit-lexicon/MWS/issues/153#issuecomment-1422992114; it might've skipped your notice!!

funderburkjim commented 1 year ago

@Andhrabharati Just found what appears to be text file version i(lang.string.changes.txt) at #7. So I think you can ignore prior request.

Andhrabharati commented 1 year ago

And I hope you've seen this https://github.com/sanskrit-lexicon/MWS/issues/153#issuecomment-1418239936

funderburkjim commented 1 year ago

Yes. We can use the fact that you are, thus far, retaining the line-number correspondence between mw_AB.txt and mw.txt.

funderburkjim commented 1 year ago

question two similar characters

There are two visually similar Greek characters which appear in mw.txt digitization. See unicode_gk_o_out.txt.

@Andhrabharati suggest we change the 'OXIA 1f79' (3 times) to 'TONOS 03cc' (171 times).

Agree?

Andhrabharati commented 1 year ago

I guess not, @funderburkjim !

See what @jmigliori was saying about the OXIA and TONOS two years back at https://github.com/sanskrit-lexicon/MWS/issues/89#issuecomment-753634001

Andhrabharati commented 1 year ago

Probably, all the TONOS occurrences may need to be changed to OXIA.

Let's ask Jonathan, if he can conclude the point.

funderburkjim commented 1 year ago

If we (reasonably) assume that the Greek text in MW is from ancient Greek, and that OXIA is likely in Ancient Greek, Then always using OXIA in mw.txt seems right.

@jmigliori Agree?

jmigliori commented 1 year ago

Yes, that’s my understanding

On Sun, Feb 19, 2023 at 1:28 PM funderburkjim @.***> wrote:

If we (reasonably) assume that the Greek text in MW is from ancient Greek, and that OXIA is likely in Ancient Greek, Then always using OXIA in mw.txt seems right.

@jmigliori https://github.com/jmigliori Agree?

— Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/MWS/issues/153#issuecomment-1436059650, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3CFLO72M55CHKYUHVIES3WYJQ4NANCNFSM6AAAAAAUR4TG7E . You are receiving this because you were mentioned.Message ID: @.***>

funderburkjim commented 1 year ago

@jmigliori Hi! Thanks for immediate reply.

@Andhrabharati I'll go ahead with changing the omicron-tonos to omicron-oxia in cdsl mw.txt.

Andhrabharati commented 1 year ago

@funderburkjim

The TONOS to OXIA change should cover all the letters, I feel.

ά > ά έ > έ ή > ή ί > ί ύ > ύ ώ > ώ

Wherever I seem to have typed the character, the OXIA has been used (as I had made my own keystrokes using autohotkey, using the ancient greek alphabet), and the earlier typed accent seems to be TONOS only.

Andhrabharati commented 1 year ago

And probably this applies to all across cdsl repos, as they all are 19th century dictionaries, which are 'ancient'!!

funderburkjim commented 1 year ago

tonos_oxia.txt shows counts of greek text characters in mw with either 'tonos' or 'oxia' in the unicode name.

One additional 'tonos' character found. 1 ΐ 0390 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS This also has an 'oxia' version ΐ 1fd3 Greek Small Letter Iota with Dialytika and Oxia

gasyoun commented 1 year ago

I love how @Andhrabharati is starting to cooperate and not only telling how wrong we are. I'm his fan.

funderburkjim commented 1 year ago

lang.string.changes use completed.

The changes to Greek (including 'oxia') have been installed into (cdsl) mw.txt. Numerous other non-Greek improvements also made based on lang.string.changes file.

changes.txt has all the lines changed in mw.txt. (711).

Note that the use of <lang>X</lang> has not been followed in cdsl mw.txt.

That's enough for this issue. Closing issue.

Andhrabharati commented 1 year ago

@funderburkjim

I wonder why the point that this issue has started with https://github.com/sanskrit-lexicon/BEN/issues/8#issuecomment-1418030666 is not 'attended to'. Just looked for φ and found 15 of them still; I thought there are NONE such in the MW99 print. [And I did not feel like checking any further.]

Andhrabharati commented 1 year ago

Need to see, if they are not listed in my file above. [My present working file has none such.]

Anyway, these corrections (at least) are good enough for now in the current CDSL file.

funderburkjim commented 1 year ago

There seem to be two kinds of phi character:

     φ 03c6 GREEK SMALL LETTER PHI
     ϕ 03d5 GREEK PHI SYMBOL

As I understand it, you prefer to use the '03d5 GREEK PHI SYMBOL' character everywhere. Right?

funderburkjim commented 1 year ago

There is something odd going on in the display of these characters. For example, in this browser the above comment looks like

BUT, if I edit the above comment, the display of the two characters switches:

Andhrabharati commented 1 year ago

I have no preferences; just looked at what the print has.

It is not out of context to mention that different works used different variants of letters (print style vs. script style) [(φ > ϕ) : MW99; (κ > ϰ) : BUR and (β > ϐ) : BEN].

funderburkjim commented 1 year ago

In print kapAla, p. 250,2 of mw

Do all the 'phi' characters in MW look the same?

Andhrabharati commented 1 year ago

Your above post made me look at my file to note that <s>gabhaim</s> is not changed to <etym>gabhaim</etym> under <ab>Hib.</ab>

funderburkjim commented 1 year ago

This Hib. change should be made of course. But I obviously missed it in my study of the lang.string file. If you have a way to discover others that I missed, please provide the items and I'll incorporate into cdsl mw,txt

Andhrabharati commented 1 year ago

BTW, one cannot notice this φ > ϕ difference with the font being used (Old Standard Indologique) at CDSL displays, as it has same glyphs at both places.

As I am using a different font [I made it myself!!] with all proper glyphs rendered as per Unicode tables, I could see the actual characters in my file!!

funderburkjim commented 1 year ago

So there is still another font issue. Ugh! I am having a similar 'same glyph' issue with Emacs for the two phis.

Andhrabharati commented 1 year ago

I have noticed more such errors in this font; but that's a different matter altogether.

funderburkjim commented 1 year ago

Repeat question: Should all the mw instances of phi be rendered as φ 03c6 GREEK SMALL LETTER PHI ?

funderburkjim commented 1 year ago

I currently count, for the two phis,

72 matches in 55 lines for "ϕ" in buffer: temp_mw_5.txt
15 matches in 14 lines for "φ" in buffer: temp_mw_5.txt

Should I change all those 72 to 03c6 ϕ" ?

Andhrabharati commented 1 year ago

As I mentioned above, the web users would not notice the difference (as there no way to change the font at client side); but this is not the letter in MW print, as you yourself have posted a snippet above with the actual letter there.

[The print has ϕ everywhere.]

funderburkjim commented 1 year ago

I'll make that change to 03c6 ϕ

We could introduce another web-font for the displays to be used just for the Greek text.

Any suggestions for such a font?

Andhrabharati commented 1 year ago

Just note that 03c6 is not ϕ.

https://unicode.org/charts/PDF/U0370.pdf

And let the font remain as is.

I see every public font having some issue or the other, so I had to make my own font(s) for my working.

funderburkjim commented 1 year ago

The remaining 'small letter phi' changed to 'phi symbol' See change_6.txt.

Note: In my windows 11 pc, this change_6.txt file displays the characters wrong; it is using the Consolas Font (Windows monospace default font). It is possible to use developer tools to (temporarily) edit the font-family properties of the greek text not to use monospace -- then Segoe UI font is used, which displays the two phis correctly.

The 'Microsoft Sans Serif' font displays these two phi characters correctly, but it is not a monospace font.

funderburkjim commented 1 year ago

The 'kapAla' correction also made.

Just note that 03c6 is not ϕ.

Right - I have been tossed around on the waves of font 'errors', and also by https://www.compart.com/en/unicode/U+03C6

and https://www.compart.com/en/unicode/U+03D5

The reference https://www.unicode.org/charts/PDF/U0370.pdf (mentioned by AB above) is thought to be correct:

The fileformat website (e.g. https://www.fileformat.info/info/unicode/char/03d5/index.htm) is also believed to be correct in this detail.

funderburkjim commented 1 year ago

I dare to close this issue once again! Will it stay closed?

funderburkjim commented 1 year ago

Have I missed anything that needs doing with mw and drawn from the Benfey link mentioned in first comment?

Andhrabharati commented 1 year ago

I dare to close this issue once again! Will it stay closed?

Though I did not want to spend more time, a 2nd thought forced me to look at the (corrected) mw.txt and the mw_AB.txt again.

Found good many changes that were 'left/missed', both in Greek portions and the rest (esp. wrt diacs and hyphenation) [even if ignoring the Indic language words, for whatever reason]!!

As Jim mostly seemed to be not that serious about the other languages portion, just posting the Greek portion AB vs. CDSL greek text differences.txt

Andhrabharati commented 1 year ago

If you have a way to discover others that I missed

I just extracted (from CDSL file and AB file) the greek strings set & the etym strings set and did a comparison between them, to yield the differences. [It was only a few minutes' effort!!]

Andhrabharati commented 1 year ago

Just listing various types of corrections in non-greek portion that got 'left/missed', for whatever worth it has--

Misc. corrections (89)
Un-tagged words (10)
Diacritic separately marked (60)
Hyphenations (6)
<etym> to <s> and vice-versa (30)
non-Skt. Indic words (?)

Andhrabharati commented 1 year ago

If you have a way to discover others that I missed, please provide the items and I'll incorporate

@funderburkjim

If you look for "Wrong Tag" entries in my above file (lang.string.changes.txt), you'd get all those <etym> to <s> and vice-versa words [just look for <s> tag under the entry and then check both old and new lines if they differ]; it appears that you had missed quite many corrections in my file (for unknown reasons).

As I have made a new file (yesterday) with easier usability (in mind), I could post that file if you're still "interested" in this issue.

funderburkjim commented 1 year ago

25 more greek corrections

corrections from AB.vs.CDSL.greek.text.differences.txt. See change_7.txt.

At least some of these were missed in earlier work above because not mentioned as 'Greek' in lang.string.changes.txt file. For instance δολιχός occurs in line labeled '(313553): Others [Spelling Error]'. Glad you took the time to help correct these omissions.

@Andhrabharati I would like to address additional corrections you mention at comment above.

I'll start with the <etym><s> items.

If you have an easy way to identify these, please post.

Andhrabharati commented 1 year ago

Glad that you're still on this issue, @funderburkjim !

Here is the file I made for CDSL corrections, which has the entries in opp. order [CDSL > AB] (wrt the Greek corrections file)-- CDSL_cognates corrections.txt

And, if you are willing to mark the other (mostly Indic, and few English) cognates, they can be seen in this addl. cognates file-- AB_addl. cognates.txt

sanskrit-lexicon / MWS

Greek text review #153

question two similar characters

lang.string.changes use completed.

25 more greek corrections