Klassische deutsche Rechtschreibung

gasyoun commented 10 years ago

@funderburkjim Per https://github.com/sanskrit-lexicon/CORRECTIONS/issues/8#issuecomment-59296504 request. Marcis - do you have a 'German word list' (a digital German dictionary or word list) that might be used to kick out candidates for mis-spelled German words in PW, PWG , CSS ?

No, I do not have, but yes, let's start the trip. I found exactly what we are looking for, German from year 1901, a list of 235298 words German old spelling dictionaries - Klassische deutsche Rechtschreibung in .OXT format.

The encoding is broken, similar to http://stackoverflow.com/questions/1344692/i-need-help-fixing-broken-utf8-encoding. Emailed Bjoern Jacke, Franz Michael Baumann about the used encoding. Reply:

im von igerman98 generierten hunspell wörterbuch ist iso8895-1 die codierung.

So http://stackoverflow.com/questions/3990700/iso-8895-1-to-xml-acceptable-utf-8 should work. We do not need to go http://askubuntu.com/questions/72099/how-to-install-a-libreoffice-dictionary-spelling-check-thesaurus, because https://www.sublimetext.com/forum/viewtopic.php?f=3&t=6127 did the job.

The journey starts at https://github.com/sanskrit-lexicon/CORRECTIONS/tree/master/dict-de_de-1901_oldspell_2014-02-21 - I hope I'll have an UTF-8 compitable list in a short while.

funderburkjim commented 10 years ago

See https://dl.dropboxusercontent.com/u/29859999/ccs_all1.zip

There happens to be a Python module pyenchant (https://pythonhosted.org/pyenchant/) which allows easy access to xspell compatible dictionaries, such as de_DE_OLDSPELL . Hurray!

It seems to give good results, and there is appears to be no need for concern re iso8895-1 coding.

Here, in short, is how this dictionary has been used thus far:

Start with ccs.xml
Categorize words appearing in the body of each record into one of 4 types, based upon the markup: a. "D" = Devanagari, coded as slp1 b. "I" = Italicized words (which contain no digit) c. "O" = Other words (which contain no digit) d. "N" = Non-devanagari words which contain a digit (most are Anglicized Sanskrit coding of IAST)
Write out file of all such words, counting the frequency (all.txt,). The format is:

  <word>:<count>:<code>:X   (<code> is D,I,O,N) (X means not-explained)

There are 62940 lines in all.txt.

Explain the German words using pyenchant interface to de_DE_OLDSPELL. This applies only to codes I or O. The result is all1.txt. Here is a summary using prettytable module:

Summary for words of type 'D'
+--------+-------+
| Status |  Freq |
+--------+-------+
| Total  | 36999 |
| X      | 36999 |
+--------+-------+
Summary for words of type 'I'
+-------------------+------+
| Status            | Freq |
+-------------------+------+
| Total             | 2284 |
| X                 |  794 |
| OK=de_DE_OLDSPELL | 1490 |
+-------------------+------+
Summary for words of type 'O'
+-------------------+-------+
| Status            |  Freq |
+-------------------+-------+
| Total             | 23204 |
| X                 |  5648 |
| OK=de_DE_OLDSPELL | 17556 |
+-------------------+-------+
Summary for words of type 'N'
+--------+------+
| Status | Freq |
+--------+------+
| Total  |  453 |
| X      |  453 |
+--------+------+

I am imagining that eventually all words will be explained or corrected by as yet unknown steps, leading to all2.txt ... alldone.txt

One can easily filter all1.txt on subcategories (e.g. :[IO]: for the supposed German words.)

I'm sure some of those 5648 unexplained 'O' German words can be explained as some kinds of compounds. There may be a way, that I don't know, to do this with enchant. Absent that, maybe Marcis can suggest some patterns of German.

The choice of de_DE_OLDSPELL was an excellent one. For instance, the enchant logic properly interpreted the 'suffix' information present in de_DE_OLDSPELL.dic and de_DE_OLDSPELL.aff -- not a trivial task.

gasyoun commented 10 years ago

Jim, it's a miracle indeed. I only wonder if we can get the contexts or make it linkable. For that I would need to know what is the header of the wrong word I have found. Otherwise I checked http://www.sanskrit-lexicon.uni-koeln.de/scans/CCSScan/2014/web/webtc2/index.php for Wgschaffen and of course it's wrong. Can I ask you to give me only O that are not OK=de_DE_OLDSPELL? Excel sorting skills when I'm on the way on my laptop are miserable. I see literary hundreds of mistakes, that I can fix even without looking in the book. But this dictionary is not top priority so let's get back to PWK, PWG, MW, SCH, VCP. This I'll think about how to make it with less possible blood on my spare time after we have some good news on MW verb lexnorm update.

wssenskundig    1   O   X
Wsserkrug   1   O   X
wunderreich 1   O   X
Wunderthat  1   O   X
Wunderthäter   1   O   X
wunderthätig   2   O   X
Wundmachen  1   O   X
wunschentsprechend  3   O   X
wunscherfüllend    2   O   X
Wunschgeborene  1   O   X
wunschgeschirrt 1   O   X
wunschgewährend    2   O   X
Wurfgeschoss    4   O   X
Wurzelschooß   1   O   X
Wurßcheibe 2   O   X

Many are totally ok, like Wunschgeborene and I wonder why they are not marked as OK, still we have found Wsserkrug, so it's a good way to clean up in batch mode old German texts. And that is exactly our case.

sanskrit-lexicon / CORRECTIONS

Klassische deutsche Rechtschreibung #18