sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

Malten corrections (German words/bot tags) #101

Open funderburkjim opened 7 months ago

funderburkjim commented 7 months ago

@maltenth expressed an interest in resolving the spelling variations seen in the text of the <bot> tags in pwk.

Here is an 'hk' version of pw. pw_hk.zip (Note this is based on csl-orig/v02/pw/pw.txt at commit d847fe33dd4e2626ebf8869325e0bca452a5f20d)

funderburkjim commented 7 months ago

bot frequency data

funderburkjim commented 7 months ago

German word corrections.

See german directory.

funderburkjim commented 6 months ago

German word corrections from non-italic text

Work done in issue101/german1 directory.

429 lines changed.

pre_change1_regular.txt, pre_change1_irregular.txt, and pre_change1_thomas.txt describe the changes.

change_word_regular.txt, change_word_irregular.txt, and change_word_thomas.txt show the exact changes made.

Here 'regular' just refers to the program generation of changes from pre-changes.

All the regular and irregular changes were developed by me by reference to the scans.

what remains to do?

  1. bot-tag corrections - hopefully Thomas will now focus on this. When bot tag corrections finished, this issue can be closed.
  2. italic text markup - I think there are numerous errors in italic markup. This is believed to be a difficult task. It should be tackled in a new issue, after the bot-tag corrections.
funderburkjim commented 3 months ago

Have heard from @maltenth -- He intends to work on the botanical names in pwk.

Andhrabharati commented 3 months ago

Nice to hear this, @funderburkjim !

Incidentally, it may be of some interest that while MW has just 45 unique instances incl. the Botanist's name(s), pwk has more than 4 times of that (~190).

More interesting is that PWG many a times has only the Botanist's name(s) separately, with a different Sanskrit name of the plant/tree having the same Sc. name, in contrast to the current entry name.

And should we have the Botanist's name(s) also expanded somewhere? It is seen that over 50 such names occur in just these three works (PWG, pwk and MW).

Andhrabharati commented 2 months ago

Another observation:

While MW has >80% of <bot>-entries with 2nd (or later) words with Cap. letter, the pwk has just <30% of such. [Interestingly, PWG has this ratio close to (just above) 'half'.]

So the Capitalisation being made THE 'norm' in the Sc. names across the CDSL works "stands" debatable!!

gasyoun commented 2 months ago

And should we have the Botanist's name(s) also expanded somewhere?

Makes sense.

Capitalisation being made THE 'norm' in the Sc. names across the CDSL works "stands" debatable

Agree @Andhrabharati

Andhrabharati commented 2 months ago

Incidentally, it may be of some interest that while MW has just 45 unique instances incl. the Botanist's name(s), pwk has more than 4 times of that (~190).

See what G.J. Meulenbeld says reg. this--

image

Andhrabharati commented 2 months ago

Here is the quick summary of counts--

image

In pwk, I had opted NOT to have the Capitalisation.

My work on MW markup has helped a lot in marking (or changing from bot to zoo) the additional <zoo words (these could not be marked so earlier, as they were not clearly identifiable in pwk 'as-is').

Andhrabharati commented 2 months ago

BTW, I've found some more 'grouped' entries in pwk now.

Andhrabharati commented 2 months ago

An interesting observation:

While both MW and pwk have almost the same count of unique <zoo entities, pwk has nearly 200 more unique <bot entities than MW.

funderburkjim commented 2 months ago

@Andhrabharati Am working with Thomas on the bot issue. Will report here when the time appears right to do so.