Open gasyoun opened 10 years ago
See https://dl.dropboxusercontent.com/u/29859999/ccs_all1.zip
There happens to be a Python module pyenchant (https://pythonhosted.org/pyenchant/) which allows easy access to xspell compatible dictionaries, such as de_DE_OLDSPELL . Hurray!
It seems to give good results, and there is appears to be no need for concern re iso8895-1 coding.
Here, in short, is how this dictionary has been used thus far:
<word>:<count>:<code>:X (<code> is D,I,O,N) (X means not-explained)
There are 62940 lines in all.txt.
Summary for words of type 'D'
+--------+-------+
| Status | Freq |
+--------+-------+
| Total | 36999 |
| X | 36999 |
+--------+-------+
Summary for words of type 'I'
+-------------------+------+
| Status | Freq |
+-------------------+------+
| Total | 2284 |
| X | 794 |
| OK=de_DE_OLDSPELL | 1490 |
+-------------------+------+
Summary for words of type 'O'
+-------------------+-------+
| Status | Freq |
+-------------------+-------+
| Total | 23204 |
| X | 5648 |
| OK=de_DE_OLDSPELL | 17556 |
+-------------------+-------+
Summary for words of type 'N'
+--------+------+
| Status | Freq |
+--------+------+
| Total | 453 |
| X | 453 |
+--------+------+
I am imagining that eventually all words will be explained or corrected by as yet unknown steps, leading to all2.txt ... alldone.txt
One can easily filter all1.txt on subcategories (e.g. :[IO]: for the supposed German words.)
I'm sure some of those 5648 unexplained 'O' German words can be explained as some kinds of compounds. There may be a way, that I don't know, to do this with enchant. Absent that, maybe Marcis can suggest some patterns of German.
The choice of de_DE_OLDSPELL was an excellent one. For instance, the enchant logic properly interpreted the 'suffix' information present in de_DE_OLDSPELL.dic and de_DE_OLDSPELL.aff -- not a trivial task.
Jim, it's a miracle indeed. I only wonder if we can get the contexts or make it linkable. For that I would need to know what is the header of the wrong word I have found. Otherwise I checked http://www.sanskrit-lexicon.uni-koeln.de/scans/CCSScan/2014/web/webtc2/index.php for Wgschaffen
and of course it's wrong. Can I ask you to give me only O
that are not OK=de_DE_OLDSPELL
? Excel sorting skills when I'm on the way on my laptop are miserable. I see literary hundreds of mistakes, that I can fix even without looking in the book. But this dictionary is not top priority so let's get back to PWK, PWG, MW, SCH, VCP. This I'll think about how to make it with less possible blood on my spare time after we have some good news on MW verb lexnorm update.
wssenskundig 1 O X
Wsserkrug 1 O X
wunderreich 1 O X
Wunderthat 1 O X
Wunderthäter 1 O X
wunderthätig 2 O X
Wundmachen 1 O X
wunschentsprechend 3 O X
wunscherfüllend 2 O X
Wunschgeborene 1 O X
wunschgeschirrt 1 O X
wunschgewährend 2 O X
Wurfgeschoss 4 O X
Wurzelschooß 1 O X
Wurßcheibe 2 O X
Many are totally ok, like Wunschgeborene
and I wonder why they are not marked as OK
, still we have found Wsserkrug
, so it's a good way to clean up in batch mode old German texts. And that is exactly our case.
@funderburkjim Per https://github.com/sanskrit-lexicon/CORRECTIONS/issues/8#issuecomment-59296504 request.
Marcis - do you have a 'German word list' (a digital German dictionary or word list) that might be used to kick out candidates for mis-spelled German words in PW, PWG , CSS ?
No, I do not have, but yes, let's start the trip. I found exactly what we are looking for, German from year 1901, a list of 235298 words German old spelling dictionaries - Klassische deutsche Rechtschreibung in .OXT format.
The encoding is broken, similar to http://stackoverflow.com/questions/1344692/i-need-help-fixing-broken-utf8-encoding. Emailed Bjoern Jacke, Franz Michael Baumann about the used encoding. Reply:
So http://stackoverflow.com/questions/3990700/iso-8895-1-to-xml-acceptable-utf-8 should work. We do not need to go http://askubuntu.com/questions/72099/how-to-install-a-libreoffice-dictionary-spelling-check-thesaurus, because https://www.sublimetext.com/forum/viewtopic.php?f=3&t=6127 did the job.
The journey starts at https://github.com/sanskrit-lexicon/CORRECTIONS/tree/master/dict-de_de-1901_oldspell_2014-02-21 - I hope I'll have an UTF-8 compitable list in a short while.