Closed funderburkjim closed 1 year ago
I already did the proofing of all greek and other languages' (cognate) strings. And filled up various missings at those places in all the languages.
Also made some better markups for these in my current review work.
@Andhrabharati How should we proceed in order to apply your corrections to cdsl mw.txt?
I shall post the existing lines vs. my corrected lines, explaining how I changed few markings at those places.
Then it should be easy for you to apply the corrections in the cdsl file, even if not going to change the markings (as per my proposal).
Only issue I see is that I had removed all the duplicate entries in my working, so you need to "look" for them in cdsl file to repeat the corrections.
@funderburkjim
Here is the file with details for quick browsing -- lang string changes.pdf
And the summary is -- NO Error lines | 584 Greek related error lines | 91 Other lang. related error lines | 168 Misc. error lines | 441
I tried making a .tsv file with individual error types (at gist.github.com), as you suggested sometime back on a different context -- https://gist.github.com/Andhrabharati/e82889769929d74df0fe35050d734504
I will be creating a separate issue with more description (at the mw-dev repo), and post the text file for your "use".
@funderburkjim
Pl. have a look at https://github.com/sanskrit-lexicon/mw-dev/issues/7#issue-1576531888
@funderburkjim
I just added some explanation on the topic at https://github.com/sanskrit-lexicon/mw-dev/issues/21#issue-1587358520
Hope you'd be taking up the corrections in the existing cdsl text itself (under v02), from various issues I am posting under mw-dev, as much as possible/applicable.
@Andhrabharati Do you have a way to prepare a text file of this pdf? Reason: Using Acrobat 9, the file is quite easy to read, but not to process with programs. A text file (with UTF-8) coding for non-ASCII characters) would be useful, I think. This would allow me to apply corrections. I've tried 'export' to text file with Acrobat, but with no useful results.
@funderburkjim
I have already posted the text file at mw-dev repo and mentioned the same above at https://github.com/sanskrit-lexicon/MWS/issues/153#issuecomment-1422992114; it might've skipped your notice!!
@Andhrabharati Just found what appears to be text file version i(lang.string.changes.txt) at #7. So I think you can ignore prior request.
And I hope you've seen this https://github.com/sanskrit-lexicon/MWS/issues/153#issuecomment-1418239936
Yes. We can use the fact that you are, thus far, retaining the line-number correspondence between mw_AB.txt and mw.txt.
There are two visually similar Greek characters which appear in mw.txt digitization. See unicode_gk_o_out.txt.
@Andhrabharati suggest we change the 'OXIA 1f79' (3 times) to 'TONOS 03cc' (171 times).
Agree?
I guess not, @funderburkjim !
See what @jmigliori was saying about the OXIA and TONOS two years back at https://github.com/sanskrit-lexicon/MWS/issues/89#issuecomment-753634001
Probably, all the TONOS occurrences may need to be changed to OXIA.
Let's ask Jonathan, if he can conclude the point.
If we (reasonably) assume that the Greek text in MW is from ancient Greek, and that OXIA is likely in Ancient Greek, Then always using OXIA in mw.txt seems right.
@jmigliori Agree?
Yes, that’s my understanding
On Sun, Feb 19, 2023 at 1:28 PM funderburkjim @.***> wrote:
If we (reasonably) assume that the Greek text in MW is from ancient Greek, and that OXIA is likely in Ancient Greek, Then always using OXIA in mw.txt seems right.
@jmigliori https://github.com/jmigliori Agree?
— Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/MWS/issues/153#issuecomment-1436059650, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3CFLO72M55CHKYUHVIES3WYJQ4NANCNFSM6AAAAAAUR4TG7E . You are receiving this because you were mentioned.Message ID: @.***>
@jmigliori Hi! Thanks for immediate reply.
@Andhrabharati I'll go ahead with changing the omicron-tonos to omicron-oxia in cdsl mw.txt.
@funderburkjim
The TONOS to OXIA change should cover all the letters, I feel.
ά > ά έ > έ ή > ή ί > ί ύ > ύ ώ > ώ
Wherever I seem to have typed the character, the OXIA has been used (as I had made my own keystrokes using autohotkey, using the ancient greek alphabet), and the earlier typed accent seems to be TONOS only.
And probably this applies to all across cdsl repos, as they all are 19th century dictionaries, which are 'ancient'!!
tonos_oxia.txt shows counts of greek text characters in mw with either 'tonos' or 'oxia' in the unicode name.
One additional 'tonos' character found.
1 ΐ 0390 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
This also has an 'oxia' version
ΐ 1fd3 Greek Small Letter Iota with Dialytika and Oxia
I love how @Andhrabharati is starting to cooperate and not only telling how wrong we are. I'm his fan.
The changes to Greek (including 'oxia') have been installed into (cdsl) mw.txt. Numerous other non-Greek improvements also made based on lang.string.changes file.
changes.txt has all the lines changed in mw.txt. (711).
Note that the use of <lang>X</lang>
has not been followed in cdsl mw.txt.
That's enough for this issue. Closing issue.
@funderburkjim
I wonder why the point that this issue has started with https://github.com/sanskrit-lexicon/BEN/issues/8#issuecomment-1418030666 is not 'attended to'. Just looked for φ and found 15 of them still; I thought there are NONE such in the MW99 print. [And I did not feel like checking any further.]
Need to see, if they are not listed in my file above. [My present working file has none such.]
Anyway, these corrections (at least) are good enough for now in the current CDSL file.
There seem to be two kinds of phi character:
φ 03c6 GREEK SMALL LETTER PHI
ϕ 03d5 GREEK PHI SYMBOL
As I understand it, you prefer to use the '03d5 GREEK PHI SYMBOL' character everywhere. Right?
There is something odd going on in the display of these characters. For example, in this browser the above comment looks like
BUT, if I edit the above comment, the display of the two characters switches:
I have no preferences; just looked at what the print has.
It is not out of context to mention that different works used different variants of letters (print style vs. script style) [(φ > ϕ) : MW99; (κ > ϰ) : BUR and (β > ϐ) : BEN].
In print kapAla, p. 250,2 of mw
Do all the 'phi' characters in MW look the same?
Your above post made me look at my file to note that <s>gabhaim</s>
is not changed to <etym>gabhaim</etym>
under <ab>Hib.</ab>
This Hib. change should be made of course. But I obviously missed it in my study of the lang.string file. If you have a way to discover others that I missed, please provide the items and I'll incorporate into cdsl mw,txt
BTW, one cannot notice this φ > ϕ difference with the font being used (Old Standard Indologique) at CDSL displays, as it has same glyphs at both places.
As I am using a different font [I made it myself!!] with all proper glyphs rendered as per Unicode tables, I could see the actual characters in my file!!
So there is still another font issue. Ugh! I am having a similar 'same glyph' issue with Emacs for the two phis.
I have noticed more such errors in this font; but that's a different matter altogether.
Repeat question: Should all the mw instances of phi be rendered as φ 03c6 GREEK SMALL LETTER PHI
?
I currently count, for the two phis,
72 matches in 55 lines for "ϕ" in buffer: temp_mw_5.txt
15 matches in 14 lines for "φ" in buffer: temp_mw_5.txt
Should I change all those 72 to 03c6 ϕ" ?
As I mentioned above, the web users would not notice the difference (as there no way to change the font at client side); but this is not the letter in MW print, as you yourself have posted a snippet above with the actual letter there.
[The print has ϕ everywhere.]
I'll make that change to 03c6 ϕ
We could introduce another web-font for the displays to be used just for the Greek text.
Any suggestions for such a font?
Just note that 03c6 is not ϕ.
https://unicode.org/charts/PDF/U0370.pdf
And let the font remain as is.
I see every public font having some issue or the other, so I had to make my own font(s) for my working.
The remaining 'small letter phi' changed to 'phi symbol' See change_6.txt.
Note: In my windows 11 pc, this change_6.txt file displays the characters wrong; it is using the Consolas Font (Windows monospace default font). It is possible to use developer tools to (temporarily) edit the font-family properties of the greek text not to use monospace -- then Segoe UI font is used, which displays the two phis correctly.
The 'Microsoft Sans Serif' font displays these two phi characters correctly, but it is not a monospace font.
The 'kapAla' correction also made.
Just note that 03c6 is not ϕ.
Right - I have been tossed around on the waves of font 'errors', and also by https://www.compart.com/en/unicode/U+03C6
and https://www.compart.com/en/unicode/U+03D5
The reference https://www.unicode.org/charts/PDF/U0370.pdf (mentioned by AB above) is thought to be correct:
The fileformat website (e.g. https://www.fileformat.info/info/unicode/char/03d5/index.htm) is also believed to be correct in this detail.
I dare to close this issue once again! Will it stay closed?
Have I missed anything that needs doing with mw and drawn from the Benfey link mentioned in first comment?
I dare to close this issue once again! Will it stay closed?
Though I did not want to spend more time, a 2nd thought forced me to look at the (corrected) mw.txt and the mw_AB.txt again.
Found good many changes that were 'left/missed', both in Greek portions and the rest (esp. wrt diacs and hyphenation) [even if ignoring the Indic language words, for whatever reason]!!
As Jim mostly seemed to be not that serious about the other languages portion, just posting the Greek portion AB vs. CDSL greek text differences.txt
If you have a way to discover others that I missed
I just extracted (from CDSL file and AB file) the greek strings set & the etym strings set and did a comparison between them, to yield the differences. [It was only a few minutes' effort!!]
Just listing various types of corrections in non-greek portion that got 'left/missed', for whatever worth it has--
<etym>
to <s>
and vice-versa (30)If you have a way to discover others that I missed, please provide the items and I'll incorporate
@funderburkjim
If you look for "Wrong Tag" entries in my above file (lang.string.changes.txt), you'd get all those <etym> to <s> and vice-versa
words [just look for <s>
tag under the entry and then check both old and new lines if they differ]; it appears that you had missed quite many corrections in my file (for unknown reasons).
As I have made a new file (yesterday) with easier usability (in mind), I could post that file if you're still "interested" in this issue.
corrections from AB.vs.CDSL.greek.text.differences.txt. See change_7.txt.
At least some of these were missed in earlier work above because not mentioned as 'Greek' in lang.string.changes.txt file. For instance δολιχός occurs in line labeled '(313553): Others [Spelling Error]'. Glad you took the time to help correct these omissions.
@Andhrabharati I would like to address additional corrections you mention at comment above.
I'll start with the <etym><s>
items.
If you have an easy way to identify these, please post.
Glad that you're still on this issue, @funderburkjim !
Here is the file I made for CDSL corrections, which has the entries in opp. order [CDSL > AB] (wrt the Greek corrections file)-- CDSL_cognates corrections.txt
And, if you are willing to mark the other (mostly Indic, and few English) cognates, they can be seen in this addl. cognates file-- AB_addl. cognates.txt
This is a placeholder issue. In another issue (https://github.com/sanskrit-lexicon/BEN/issues/8), @Andhrabharati points out the need, for mw.txt, to
When the time is right for review of mw.txt, further comments will be made here.