sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

KCH corrections from faultfinder #72

Closed gasyoun closed 2 years ago

gasyoun commented 9 years ago

@funderburkjim can I ask you to help me a file similar to the series published before? I'll do all the checking, it's needed for a new edition of a Sanskrit-Russian dictionary. It's SLP1. Panchavargama conventions might be different than MW, but still. Documented all the changes from the source .xls file based on web scraping based on printed book. 1) Deleted all acute accent marks ´ (4904) 2) Deleted compositia splitting - (7284) 3) change-replaced all /

4) converted from IAST to SLP1

5) deleted a single case dirt

8) The draft file for faultfinder is at https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/Kochergina-1987_29007.txt

funderburkjim commented 9 years ago

@gasyoun You'll have to be more specific. I'm not sure what your objective is What is the input required, and what is the output your aiming for?

Incidentally, at the moment I think the Kochergina-1987_29007.txt file should NOT be part of the CORRECTIONS repository. I'd like to reserve that repository for the actual corrections made to dictionaries, and material closely related to corrections (like the working documents Sampada uses to generate corrections to missing data, etc.)

funderburkjim commented 9 years ago

@gasyoun You'll have to tell me more if you want me to help in some way.

drdhaval2785 commented 9 years ago

@gasyoun The ball is in your court

gasyoun commented 9 years ago

@funderburkjim - objective is to clean the only big open source Sanskrit-Russian dictionary (29k words only), mainly by comparing to MW, PWG and PWK. There must be at least 200 mistakes, I propose in the headwords. Input: https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/Kochergina-1987_29007.txt Output: similar to https://github.com/sanskrit-lexicon/CORRECTIONS/issues/104 or http://drdhaval2785.github.io/o_vs_O/output1/MW.html - one of the used correction methods applied. As per "should NOT be part of the CORRECTIONS repository" - I agree, but as I want to use the same cleaning methods, let it be there for a while.

funderburkjim commented 9 years ago

re 'should NOT be part of the CORRECTIONS repository' --- sounds like I was being needlessly pedantic. Fine to keep it whereever.

Looking at your input, it seems to be a list of SLP1 spelled headwords. One easy thing to do would be for me to write a program to see which spellings appear as key1 in MW, and to kick out the exceptions. If you agree this is useful first step, I'll do that.

gasyoun commented 9 years ago

I agree. MW was her main source, PWG, PWK, AP90, MD - less.

funderburkjim commented 9 years ago

Have completed first pass, comparing to MW. 4622 of the 29007 words NOT MW headwords. kcp-mw.txt adds a Y/N to each word, indicating whether in MW or not. Some possible reasons for words in the 4622 non-MW category:

gasyoun commented 9 years ago

Thanks a lot. Can we write some rules of treating -vant like -vat? This issue is wider, but rules should help to find the real differences, and not just Indian / European standards, or orthography concerned. It's not only KCH vs. MW. Same issue with PWG vs. MW. Can we add some rules, please?

funderburkjim commented 9 years ago

I didn't offer to do the whole project for you, only to provide technical assistance. It's up to you to derive useful rules. Your job is to make my life as easy as possible in this task. So take some time thinking through what you need. Maybe your programmers could help implement your ideas?

gasyoun commented 9 years ago

Sounds realistic, let me try. Where can I see the code used for the comparison, so my coders can catch up the work, Jim?

funderburkjim commented 9 years ago

I put the python program compare-MW.py in the same Gist as above. The 'Usage' comments shows how it is run.

You already have Kochergina-1987_29007.txt.

The other ingredient is a list of MW keys, for which I used a file that you can get by the following curl command:

curl -o extract_keys_b.txt 'http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2014/mwaux/mwkeys/extract_keys_b.txt'

This should get you started.

gasyoun commented 9 years ago

Great, before you let me go - can we have a fuzzy comparison of MW vs. KCH?

funderburkjim commented 8 years ago

Second phase of analysis of Kochergina headword spelling begun.

The results are three files, in this gist

This log file also shows, at the bottom, statistics regarding solutions

@gasyoun Let me know if this form is what you had in mind. I know there are some additional rules, involving nasals, that can be applied once we agree on this form.

gasyoun commented 8 years ago

Thanks, Jim. Still I hardly understand are cases like

Covered now with the rules, like the ones at https://github.com/sanskrit-lexicon/CORRECTIONS/issues/72#issuecomment-150351413 ?

funderburkjim commented 8 years ago

Cases like vikalpavant NOT yet handled. I'm working on additional rules at the moment, and one of those will recognize that vikalpavant is the same as mw vikalpavat.

Will post here when revisions to the Gist files are ready.

funderburkjim commented 8 years ago

The Gist files are now updated.

The summary from the log file:

2153 spellings remain unvalidated
2464 spellings validated
60 validated by method default
379 validated by method ṁ->M
536 validated by method ar->f
204 validated by method M->nasal
711 validated by method N->M
14 validated by method N->M,ar->f
560 validated by method ant$->at

ALSO, 40 records are marked as 'MALFORMED' (letters other than a-zA-Z).

So the list of questionable items has been more than cut in half.

gasyoun commented 6 years ago

So the list of questionable items has been more than cut in half.

Thanks, Jim. I will get back to it as well. I value your help a lot.

The author died and was cremated today, see https://groups.google.com/forum/#!topic/bvparishat/VOZBNUKdFuQ

gasyoun commented 2 years ago

So the list of questionable items has been more than cut in half.

Is there a clean list of output, only MW kind of spelled Kochergina, @funderburkjim ? Please, please, please.

Sonnetag commented 2 years ago

Hello,

This is about the broken links in Monier-Williams 1899. As you can see the screen shot below, there are links to Whitney and Westergaard Dhatupatha. Both links are broken (Page not Found error message). Can anyone please restore the links? So we can look them up when necessary.

Thank you, Youngsinn

On 3/8/2022 02:42 PM, Mārcis Gasūns wrote:

So the list of questionable items has been more than cut in half.

Is there a clean list of output, only MW kind of spelled Kochergina, @funderburkjim https://github.com/funderburkjim ? Please, please, please.

— Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/CORRECTIONS/issues/72#issuecomment-1062137059, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZJLAK7FR6AALRCC7YC3VLU66UTRANCNFSM4A2XDKVA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

funderburkjim commented 2 years ago

@Sonnetag

Did you forget to copy the screen shot into a comment?

Is the problem happening for just one headword? If so, what is the headword?

Sonnetag commented 2 years ago

The screen shot below. Type any word in Monier-Williams 1899 and click the link to check.

No it is not just one headword. For all headwords, the links to Whitney Roots Links and Wetesgaard Dhatupatha are broken.

Youngsinn

On 3/16/2022 10:14 PM, funderburkjim wrote:

@Sonnetag https://github.com/Sonnetag

Did you forget to copy the screen shot into a comment?

Is the problem happening for just one headword? If so, what is the headword?

— Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/CORRECTIONS/issues/72#issuecomment-1069864253, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZJLAIR4P2KJSJJP35YUR3VAKIPDANCNFSM4A2XDKVA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

funderburkjim commented 2 years ago

@Sonnetag For some reason, I do not see a screen shot in your comment.

Here's a screen shot of display of 'viS' And the links seem to work. Here are two screen shots: First, the display in MW Advanced Search:

image

Second, result when I click on Whitney roots viS (slp1) (after I scroll down to 'viS') image

At this stage, I cannot replicate your problem. More information will be needed. What kind of computer are you using? (desktop, laptop, phone? -- Windows, Mac, Linux?) Which display are you using (advanced search, basic, etc or simple-search?).

And I need to see an image of what appears on your screen when you click the Whitney roots link

Sonnetag commented 2 years ago

Sorry if this is a duplicate; just delete it. I sent out the message copied below but I do not know where it was sent. So I am sending it to the all listed again. Thanks.

Hello,

This is about the broken links in Monier-Williams 1899. As you can see the screen shot below, there are links to Whitney and Westergaard Dhatupatha. Both links are broken (Page not Found error message). Can anyone please restore the links? So we can look them up when necessary.

Thank you, Youngsinn

On 3/8/2022 02:42 PM, Mārcis Gasūns wrote:

So the list of questionable items has been more than cut in half.

Is there a clean list of output, only MW kind of spelled Kochergina, @funderburkjim https://github.com/funderburkjim ? Please, please, please.

— Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/CORRECTIONS/issues/72#issuecomment-1062137059, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZJLAK7FR6AALRCC7YC3VLU66UTRANCNFSM4A2XDKVA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

Sonnetag commented 2 years ago

Ok Jim, this time I attached the image of the screen shots.

I am using Windows 10 installed on the Mac. It is desktop. I tried विश्and link did not work on my machine. It had been working all along but from a certain point (not sure exactly but probably some time last year) the links stopped working.

I think I am using basic search. (screen shot al attached.)

On 3/17/2022 02:21 PM, funderburkjim wrote:

@Sonnetag https://github.com/Sonnetag For some reason, I do not see a screen shot in your comment.

Here's a screen shot of display of 'viS' And the links seem to work. Here are two screen shots: First, the display in MW Advanced Search:

image https://user-images.githubusercontent.com/6393033/158868880-3ef529c7-58dc-43f4-ac83-7fcb820d9d1d.png

Second, result when I click on Whitney roots viS (slp1) (after I scroll down to 'viS') image https://user-images.githubusercontent.com/6393033/158869249-032e0ff9-339b-4863-a6e0-3bc0b2662ba9.png

At this stage, I cannot replicate your problem. More information will be needed. What kind of computer are you using? (desktop, laptop, phone? -- Windows, Mac, Linux?) Which display are you using (advanced search, basic, etc or simple-search?).

And I need to see an image of what appears on your screen when you click the Whitney roots link

— Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/CORRECTIONS/issues/72#issuecomment-1071182156, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZJLAKNGYL7UNTETHTKDBDVANZZJANCNFSM4A2XDKVA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

funderburkjim commented 2 years ago

@Sonnetag Hi, Youngsinn - I see your screenshot https://user-images.githubusercontent.com/6393033/158868880-3ef529c7-58dc-43f4-ac83-7fcb820d9d1d.png. The image shows you are using advanced search for mw (1899), have looked up the word 'viS' (slp1). You have underlined in red the link to Whitney roots, and I understand that (somehow) this link is not working for you.

I just tried the same thing on my desktop computer, and the link works fine. Here is screen shot of page from Whitney roots that appeared when I clicked the link: image

My desktop is a Windows 11 pc.

Suggestion: If possible, just run your mac as a Mac, and use Safari or Chrome or Firefox, and see what happens.

I don't have a Mac, much less a Mac with Windows 10. I cannot duplicate your computing environment.

Sonnetag commented 2 years ago

Jim,

It is apparent that it has to do with my computer set up. I found that when I access MW from 'Inflected form lookup', then the link to Whitney roots works! I bought Whitney's hardcopy book and also dowloaded PDF file of the book from Archive, so I have what I need.

Thank you, Youngsinn

On 10/11/2022 11:04 PM, funderburkjim wrote:

@Sonnetag https://github.com/Sonnetag Hi, Youngsinn - I see your screenshot https://user-images.githubusercontent.com/6393033/158868880-3ef529c7-58dc-43f4-ac83-7fcb820d9d1d.png. The image shows you are using advanced search https://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2020/web/webtc2/index.php for mw (1899), have looked up the word 'viS' (slp1). You have underlined in red the link to Whitney roots, and I understand that (somehow) this link is not working for you.

I just tried the same thing on my desktop computer, and the link works fine. Here is screen shot of page from Whitney roots that appeared when I clicked the link: image https://user-images.githubusercontent.com/6393033/195239585-5ca49b5f-464e-42fd-b32c-94f9a636e75f.png

My desktop is a Windows 11 pc.

Suggestion: If possible, just run your mac as a Mac, and use Safari or Chrome or Firefox, and see what happens.

I don't have a Mac, much less a Mac with Windows 10. I cannot duplicate your computing environment.

— Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/CORRECTIONS/issues/72#issuecomment-1275521821, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZJLAM5RLWVRDPX22YB3PLWCYTE3ANCNFSM4A2XDKVA. You are receiving this because you were mentioned.Message ID: @.***>

funderburkjim commented 2 years ago

@Sonnetag Thanks for feedback. Glad you found a workaround. Will consider this issue closed now.