Etymology coding in MW - Githubissues

funderburkjim commented 7 years ago

`e8` noticed by user

User pcipolla submitted a Correction:

L=102621.2, hw=na 2
Lat. ne8-   --> Lat. nĕ-
Comment: False scan/encoding: You may wish to check the whole document for just under two dozen
 other instances of "ĕ" falsely coded as "e8". Exempli gratia, for [L38692] and [L38693] sub voce ṛghā 
"e8re8ghant" should read "ĕrĕghant".

Indeed there were found 22 such cases. The '8' in 'e8' was the letter-number coding used by Thomas for adding the 'breve' diacritic. I also noticed 'u8' , 'o8', and 'c8' .

Since we are now comfortable with using unicode IAST in the digitizations rather than the original letter-number coding, I've changed these to unicode e-breve, u-breve, o-breve, and c-caron (since there is no c-breve as a single unicode code point).

More to be done in Etymologies

I'm sure that that are still other odd characters with diacritics in the 'etymology' sections of entries in MW. These characters are coded in the letter-number format, within MW

Current displays change the AS coding.

While these etymology diacritics appear in mw.xml with the letter-number (AS) codings. the displays generally transcode these to Unicode characters. For instance, in headword 'pard', the display shows:

(H1) pard [p= 606] : cl.1 A1. (Dhātup.  ii, 28) to break wind downwards Sarasv.  i, 25. 
[cf. Gk. πέρδω ; Lat. pēdo, pōdex ; Lith. pérdżu ; 
Germ. farzen, furzen ; Angl.Sax. feortan ; Eng. fart.] 
[L=119581]

But the mw.xml coding still uses letter-numbers:

<H1><h><hc3>500</hc3><key1>pard</key1><hc1>1</hc1><key2>pard</key2></h><body> 
<vlex type="root"></vlex> <vlex>cl.1 A1.</vlex> <p><ls>Dha1tup._ii_,_28</ls></p> <c>
<to/>to_break_wind_downwards</c> <ls>Sarasv._i_,_25.</ls> <b><c><ab>cf.</ab>_<ab>Gk.
</ab>_<gk>1</gk>_;_<ab>Lat.</ab></c>~<etym>pe1do</etym>~,~<etym>po1dex</etym>~
<c>;_<ab>Lith.</ab></c>~<etym>pe4rdz3u</etym>~<c>;_<ab>Germ.</ab></c>~
<etym>farzen</etym>~,~<etym>furzen</etym>~<c>;_<ab>Angl.Sax.</ab></c>~
<etym>feortan</etym>~<c>;_<ab>Eng.</ab></c>~<etym>fart</etym>.</b> 
</body><tail><mul/> <MW>076691</MW> <pc>606,3</pc> <L>119581</L></tail></H1>

It would be good to change the coding in mw.xml to Unicode from letter-number in the etymology sections of mw.

⚠️ The letter-number scheme also appears in other sections of MW:

<ls> abbreviations of literary sources
<as0> IAST form of Sanskrit words.

It is not known whether the letter-number codings represent the same Unicode characters in these 'Sanskrit words' as they do in the non-Sanskrit words that appear in the etymologies. That's what the warning is about: Carrying through the Unicodification in the etymology sections must be done with care.

gasyoun commented 7 years ago

c-caron

Oh, this defect Unicode.

I'm sure that that are still other odd characters with diacritics in the 'etymology' sections of entries in MW

Can you extract them, please? I would take an eye on them.

It is not known whether the letter-number codings represent the same Unicode characters in these 'Sanskrit words' as they do in the non-Sanskrit words that appear in the etymologies. That's what the warning is about: Carrying through the Unicodification in the etymology sections must be done with care.

Not sure I understood, but I can fix at least part of the wrongness of etymologies.

gasyoun commented 6 years ago

cf. Lat. vir ; Lith. vy4ras ; Goth. wair ; Angl.Sax. wr, wre-wulf ; Eng. werewolf ; Germ. Werwolf, Wergeld. ] [L=203601]

Angl.Sax. words partly marked.

Andhrabharati commented 3 years ago

I also noticed 'u8' , 'o8', and 'c8' .

Since we are now comfortable with using unicode IAST in the digitizations rather than the original letter-number coding, I've changed these to unicode e-breve, u-breve, o-breve, and c-caron (since there is no c-breve as a single unicode code point).

c-caron

Oh, this defect Unicode.

This is not a Unicode defect; in all the three places (Russain words) where the c8 occurs in mw_orig.txt, they do indicate c-caron only. It is the defect (wrong encoding) in digitisation, I would say!

I'm sure that that are still other odd characters with diacritics in the 'etymology' sections of entries in MW

Can you extract them, please? I would take an eye on them.

You're yet to persuade Jim to incorporate the Lithuanian words done few months back, @gasyoun !

Andhrabharati commented 3 years ago

@drdhaval2785 are you interested in cleaning up the [cf.] blocks of etym. words, if I give out my data (lying in my folders for ~6 years now) ?

I had mentioned about this before, but just gave the Lithuanian portion in another thread.

drdhaval2785 commented 3 years ago

Sure. I would work out some way to incorporate the details.

Andhrabharati commented 3 years ago

it should be the similar way as you had worked with my file now - just a find/replacement of strings/lines.

Andhrabharati commented 3 years ago

let's try to do whatever is possible without bothering @funderburkjim.

for major points there should be some consensus anyway !

drdhaval2785 commented 3 years ago

Kindly give the files as you mention. I see one file in Lithuanian words issue. But as you are suggesting to give all the language works together, I would wait for it.

Andhrabharati commented 3 years ago

yes, I posted there first and then saw this and posted here.

That issue can be closed as this will cover it also.

I will give the file in a day or two, as it needs to be prepared in the "new format" I myself have introduced now.

Andhrabharati commented 3 years ago

however, that Lithuanian issue has one important piece that can remain still, may be as a separate issue, as it is worth some consideration. And it was also proposed by me!

It is to have present wordforms as tooltips, for the archaic words in all those olden works. I am sure quite many of the words would have changed over the century+ period that has elapsed since they got printed in these dictionaries.

Andhrabharati commented 3 years ago

incidentally this issue also has the wrong notion of IAST for encompassing all European languages.

It seems in went into nerves and blood of the team, since beginning!!

Andhrabharati commented 2 years ago

@drdhaval2785

seems I have to put some good amount of effort to bring my old data into present format of Cologne; too many changes took place in taggings over the past 5-6 years.

just searching with "lang" tag gave 1200+ lines in my old file, and only about 850 in the present cologne file.

thinking of how best to make my life (task) easy!

drdhaval2785 commented 2 years ago

Don't worry. We will figure out some way to reconcile the differences. But it would be necessary to see your files first.

Andhrabharati commented 2 years ago

After carefully looking at Cologne's present file and my old file, noticed some systematic changes.

My file has <lang> tag for all non-Skt. languages (incl. Prakrit!), and Cologne's present data has it only for Greek, Arabic & Persian. Other languages have it as <etym>; do not recall if I changed it while I was working on it to make it uniform across all languages, or it is changed by Cologne team sometime later.
My file has no accents in Devanagari strings, as we had "removed" them during our conversion those days for whatever reason.
Also there are many changes in tagging style & additionally introduced tags in Cologne data now.

So I am splitting my data into two parts now, one matching with Cologne's present <lang> tags and all others in another part. Probably I should be able to close this process by tonight.

And would leave it to you to handle the parts appropriately with this above info.

Andhrabharati commented 2 years ago

Forgot to mention another point, I had all the etym. portions starting with "[cf." in another line. Cologne data has some in same line at the end and some in another line.

So enough care should be taken to replace only part of the Cologne line, while incorporating my data. No full line replacements in such cases.

Andhrabharati commented 2 years ago

Successfully completed splitting my data, first portion aligned to the 825 <lang> lines of present cologne data, and second portion separated out as another file in the process.

Two important observations:

1. My file has some Greek words in Capital letters, and Cologne has them in small letters. Do not know the reason!

Probably something might've happened while I gave the Greek letters data for correction earlier, in Excel form; but even the Greek expert @jmigliori did not identify these!! (He seemed to have referred to the scan/book at some words.)

Giving the first portion data as Excel file, for human reading (I wanted to draw attention to two places that I had remarked). MW_lang lines.xlsx

@drdhaval2785 can make the text file out of it just by copy/paste, and try to look for the possible means of handling the indicated corrections/differences. ------------- If Dhaval feels it is alright, I shall give the other 476 lines from my old data having <lang> tag, aligning them with<etym> lines of present Cologne data. [I will be waiting for his response before talking up this work.]

Incidentally, as I had extracted the<etym>strings from Cologne data and removed the lines having Greek, Arabic and Persian (as they are already covered above), only 380 have remained.

2. So my file has 96 lines extra, that are yet to be identified with corresponding present Cologne lines.

Andhrabharati commented 2 years ago

BTW, the column B in Excel is hidden, which has the HW part from the Cologne lines.

It may be unhidden and used, if needed.

Andhrabharati commented 2 years ago

just a thought.

as there are just about 1200+ lines, probably its better to read the present cologne data and correct them directly.

this should not take more than 3-4 days.

I had already spent more than a day and atleast another half-day is needed to align the 2nd part lines from my file.

I had looked only at the tagged portions those days and not at other portions like punctuations etc.

it will be another reading from my side for those complete lines (and I am more "matured" now in the process), and also it saves Dhaval's time considerably.

what would @drdhaval2785 say about this?

Andhrabharati commented 2 years ago

this whole exercise that went now may be treated as some "show-off" (I do mean it in that sense, as that work has some flaws) that I did some work those days.

gasyoun commented 2 years ago

also it saves Dhaval's time considerably.

@drdhaval2785 must love it.

Andhrabharati commented 2 years ago

Anyway, first let Dhaval have a look at the file sent and try to make a plan about using its data, as that was the original idea agreed upon.

funderburkjim commented 2 years ago

whatever is possible without bothering @funderburkjim.

Good idea!

drdhaval2785 commented 2 years ago

funderburkjim commented 2 years ago

Good improvement!

sanskrit-lexicon / CORRECTIONS

Etymology coding in MW #362

`e8` noticed by user

More to be done in Etymologies

Current displays change the AS coding.

Two important observations:

sanskrit-lexicon / CORRECTIONS

Etymology coding in MW #362

e8 noticed by user

More to be done in Etymologies

Current displays change the AS coding.

Two important observations:

`e8` noticed by user