`o` vs `O` - Githubissues

gasyoun commented 9 years ago

@funderburkjim Can we make a list of words with o vs O? If you would be able to make a video, I would be able to learn and become 5 g smarter.

To have both

sUpodanazazWIpUjA:PW f.  Titel
sUpOdanazazWIpUjA:MW f. N. of wk.

seems fishy to me.

gasyoun commented 9 years ago

Very APIsh indeed @funderburkjim

drdhaval2785 commented 9 years ago

@funderburkjim You need to provide sanhw1 corrected for faultfinder to proceed with this proposal. Waiting. Last corrected version of sanhw1.txt on this repository is month old.

gasyoun commented 9 years ago

@funderburkjim is all we can hope on.

Shalu411 commented 9 years ago

Yes. Me ready for taking up the task now.

funderburkjim commented 9 years ago

sanhw1 is brought up to date.

gasyoun commented 9 years ago

435k word-forms are ready for your attention, @drdhaval2785

drdhaval2785 commented 9 years ago

Work started. It is painfully slow, even on commandline. Will post soon

drdhaval2785 commented 9 years ago

Work over. Total of 5858 suspect entries found by this method.

The output files are available at https://github.com/drdhaval2785/SanskritSpellCheck/tree/master/o_vs_O/output1 For viewing HTML files on github.io, please see http://drdhaval2785.github.io/o_vs_O/output1/AP.html. (Change the file name with appropriate dictionary abbreviation e.g. MW.html, PWG.html etc)

@funderburkjim You need to answer https://github.com/sanskrit-lexicon/CORRECTIONS/issues/45#issuecomment-96418761 before @Shalu411 can start her work. If the format is OK for you, she can start her work.

@Shalu411 Please remember the instruction

Note: 
Please focus only on the corrections in the dictionary under consideration.
If you see any errors in the dictionary other than the one you are dealing with, leave it.
You will encounter it in the dictionary concerned. We will treat it there.

Best luck team

funderburkjim commented 9 years ago

@drdhaval2785 धव्हल->धवल format should be fine for indicating corrections - I should be able to convert the Devanagari back to slp1 which will be what updates actually need.

@drdhaval2785 @Shalu411 I have not followed this issue closely enough to know the work flow details Shalu will be using. Thus, I request that at some early stage of the work, Shalu or someone send me a sample of the work product , so I can be sure that all details I need are available and amenable to programmatic use.

drdhaval2785 commented 9 years ago

@funderburkjim Let me recapitulate the whole thread for you. Logic behind the approach - https://github.com/sanskrit-lexicon/CORRECTIONS/issues/45#issuecomment-92870513 User instruction - https://github.com/sanskrit-lexicon/CORRECTIONS/issues/45#issuecomment-96417503

This generated two output files. HTML file is something like this capture and TXT file is something like this

In HTML file --> Column 1 - index Column 2 - SLP1 headword in dictionary under consideration Column 3 - SLP1 headword having nearest match in other dictionaries Column 4 - Devanagari headword in dictionary under consideration Column 5 - Devanagari headword having nearest match in other dictionaries Column 6 - Link to PDF having the headword in dictionary under consideration Column 7 - Link to PDF of nearest match in other dictionaries.

@Shalu411 You are supposed to have a look at the column 4 and 5 and decide whether there is any apparent error in column 4 headword. If you think so, click on column 6 and 7 PDF links to verify from the dictionary. If you find any error, correct it in the TXT file. e.g. अभिसंपत->अभिसंपत्‌ in AP.txt

In case there is no error - अभिसंपत->NO CHANGE

@funderburkjim will programmatically correct the errors found by this approach.

Does this proposal sound good Jim and Shalu?

gasyoun commented 9 years ago

@Shalu411 do you need the ID in txt file to find words quicker? I mean comparing HTML and TXT can be hard. NO CHANGE you can just copy-paste, no need to write it every time, because there might be hundreds of cases. @drdhaval2785 longer words should come first. Please see https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/fuzzy_apte_new_27_11_2013.xlsx Longers come first, so you first kill of the most possible cases. In a 15 character word the chance is high that the other, wrong word is the same. When we deal with 2 letter words - all the comparison are false positives. In the original XLS file there is sorting by letter that has changed as well. And it's red, that means marked in the list so it's easy for the eye to catch. @Shalu411 has already worked with the file and some of the requirements are her. I'm experimenting with http://www.listjs.com/ to see what else can I add - we can have even Search on the page and in case of 1k+ lines per dictionary that is important. Please, please, please. I'll do my best to improve the UI. Otherwise

सूनू->सुनु
सूनू->सुनू
सूनू->सूनु

does not make much sense. Does it?

http://drdhaval2785.github.io/o_vs_O/output1/MW.html 191 hrum hrUm ह्रुम् ह्रूम् MW PW,PWG 192 aMSAMsi aMSAMSi अंशांसि अंशांशि MW GST,SHS,WIL

What is the logic between the split?

Converting (L) issues remain: 208 prahlanniH prahLanniH प्रह्लन्निः प्रह्Lअन्निः AP AP90

funderburkjim commented 9 years ago

@drdhaval2785 @Shalu411 I've now got programs to handle the 'ap.txt' form you provided above.

Here are some issues I noticed (I converted ap.txt to slp1 before generating update records, so slp1 spelling is what is shown in these comments):

Some (perhaps many) of the cases are clear deviations from the printed text.
For instance, case 1 (aRuBU -> aRUBU).
I suggest that these be marked as 'PRINT ERROR' (or some such convention), so we can put these cases, where we are purposely deviating from the text, into a corrections_factual.txt file, as we have tried to do with MW. This seems especially desirable for such a major dictionary as AP.
For some cases, there are homonyms. Is it your intent that ALL the homonyms be changed ? Probably this has to be handled on a case-by-case basis:

HOMONYMS FOUND: 67 pAriyA -> pariyA
HOMONYMS FOUND: 79 viDra -> vidra
HOMONYMS FOUND: 264 avakIrRa -> avAkIrRa
HOMONYMS FOUND: 292 Aruh -> ArUH
HOMONYMS FOUND: 294 Avid -> AviD
HOMONYMS FOUND: 302 itiH -> ItiH
HOMONYMS FOUND: 358 kavi -> kavI
HOMONYMS FOUND: 369 kUba -> kUBa
HOMONYMS FOUND: 395 giri -> girI
HOMONYMS FOUND: 396 giri -> GiRi
HOMONYMS FOUND: 420 co -> CO
HOMONYMS FOUND: 474 niryat -> niryAt
HOMONYMS FOUND: 505 pratyaBijYA -> pratyABijYA
HOMONYMS FOUND: 550 maYju -> maYjU
HOMONYMS FOUND: 712 aBI -> abi
HOMONYMS FOUND: 733 avAYcita -> avaYcita
HOMONYMS FOUND: 763 AsyA -> ASyA
HOMONYMS FOUND: 786 upaDiH -> upadih
HOMONYMS FOUND: 831 kUba -> kuba
HOMONYMS FOUND: 832 kUba -> kuBa
HOMONYMS FOUND: 841 kesa -> keSa
HOMONYMS FOUND: 941 druh -> druH
HOMONYMS FOUND: 945 DanI -> Dani
HOMONYMS FOUND: 960 nirAsaH -> nirasaH
HOMONYMS FOUND: 961 nirAsaH -> nirASaH
HOMONYMS FOUND: 1051 masI -> maSi
HOMONYMS FOUND: 1105 varI -> vari
HOMONYMS FOUND: 1150 SAlu -> Salu

In the case of pAriyA -> pariyA (case 67), there were several concerns:
- The Devanagari of the text clearly shows 'pAri...'
- The IAST in the body of the text also shows long 'pAriyA', so I doubt the correction to 'pariyA'
- There are two homonyms, and it seems that actually the headwords are pAriyAtraH and pAriyAtrikaH.
- I am not sure how to interpret the parenthetical 'pA' in these two cases.
Given these random observations, I am not sure of the principles on which the corrections (especially those that deviate from the printed text) are being made.

drdhaval2785 commented 9 years ago

Jim, the ap.txt is not corrected. I just wanted to confirm whether you are OK with the format to apply the changes.

PRINT ERROR has to be written after a colon, right?

As regards the homonyms issue, I advise that they may be handled case by case. You can provide the list as you have done in this case. Shalu can check and decide which homonym needs correction, if any.

Shalu411 commented 9 years ago

Namaste I have started looking at issues. First Html picked is AP. https://www.dropbox.com/s/awzsxyl630gijmw/AP.html?dl=0 For now, work way is as follows- I keep open both Html and txt files. (txt file link- https://www.dropbox.com/s/yqnuxqdcdi61nbk/AP.txt?dl=0 ) I look at High priority list in the beginning - from bottom to top in the Html. Then if any notable issue is there, scan link is checked. If no error I LEAVE IT AS IT IS IN THE TXT FILE. I make no comment, do not write any thing there. If error exists, then-- I make a comment beside the end of the right word as "; -ch", so that you can check for these easily later using simple program to check for "; -ch" and separate them to another txt file. Errors found so far are 3 kinds- And they are solved in the txt file as follows-

If its digitization error, and word after "->" option (suggested probable right word) in the txt file is right, Then just "; -ch" is written. If word after -> is not right one, then I erase it, and enter right word, and leave a "; -ch" after the corrected word.
If print error, then I put "PE" after the "; -ch"
If any other comment is needed, like for Eg. Line 76- वाधु->वाधु(धू)क्यम् ; -ch This is a double word. On opening bracket it will give वाधुक्यम्, वाधूक्यम्. But in the digitizing, they made entry of only वाधु out वाधु(धू)क्यम्. Please let me know if this is fine.

Shalu411 commented 9 years ago

One more doubt. Do you need the snippets of the corrections? Every time I check, do I keep the checked words' snippets, to check for later? Or, is my checking it enough?

gasyoun commented 9 years ago

Your's is enough.

drdhaval2785 commented 9 years ago

No snippets. Checking is enough. I have some doubts regarding the proposed txt file though. Let Jim comment, because he has to process

drdhaval2785 commented 9 years ago

OK. I re read the suggestion. Seems fine. As corrections are less and no change are many, you have suggested to tag only the changed word. This sounds good and manageable.

Shalu411 commented 9 years ago

@drdhaval2785 Oh, Thanks a lot. @gasyoun There seem to be these double word issues in Apte dict. digitization. Will need to think भृङ्गिरि->भृङ्गिरि(री)टिः ;-ch This is a double word. On opening bracket it will give भृङ्गिरिटिः, भृङ्गिरीटिः. But in the digitizing, they made entry of only भृङ्गिरि out of भृङ्गिरि(री)टिः.
@funderburkjim Your word awaited on these issues specially. I note all such entries using same kind of comment.. or will you need a format on that?

drdhaval2785 commented 9 years ago

We cannot foresee all possibility. So manual handling is the best in such cases. Your comments seem the best approach to do this

Shalu411 commented 9 years ago

OK. Thanks. :) I find that the digital dictionary on has the bracketed word in article. This is strange- भृङ्गिरि [L=25285] [p= 1209-2] भृङ्गिरि (री) टिः See भृङ्गरिटि. [Page1210-1]

funderburkjim commented 9 years ago

@Shalu411 Please post to dropbox the part of the file that you have completed. Dhaval and I will need to see what you are doing to know how to comment on the conventions you are using.

I don't understand the 'strangeness' in L=25285 example.

gasyoun commented 9 years ago

@Shalu411 let's make a new thread with Apte only cases, otherwise this topic will become a mess. It's about the strategy in general. Agree?

Shalu411 commented 9 years ago

Namaste @funderburkjim I have already given the link- Here I give again- https://www.dropbox.com/s/yqnuxqdcdi61nbk/AP.txt?dl=0 Please see at line 85 and then move above.

I don't understand the 'strangeness' in L=25285 example. भृङ्गिरि [L=25285] [p= 1209-2] भृङ्गिरि (री) टिः See भृङ्गरिटि. [Page1210-1] भृङ्गिरि (री) टिः -- this should have been the original head word. But it does not appear so in the list of head words. Only भृङ्गिरि this part appears. And then when we click the article, we see भृङ्गिरि (री) टिः. Many such cases are there till now. Please see the text file for others. [Please see at line 85 and then move above. ]

@gasyoun Sure. Agree. :)

funderburkjim commented 9 years ago

@Shalu411 re AP.txt --- Yes, I have seen that file, and it is fine. However, you mention several 'refinements' to that format:

f no error I LEAVE IT AS IT IS IN THE TXT FILE. I make no comment, do not write any thing there.
If error exists, then-- I make a comment beside the end of the right word as "; -ch", so that you can check for these easily later using simple program to check for "; -ch" and separate them to another txt file.
Errors found so far are 3 kinds- And they are solved in the txt file as follows-
1. If its digitization error, and word after "->" option (suggested probable right word) in the txt file is right, Then just "; -ch" is written.
If word after -> is not right one, then I erase it, and enter right word, and leave a "; -ch" after the corrected word.
2. If print error, then I put "PE" after the "; -ch"
3. If any other comment is needed, like for Eg. Line 76- 
वाधु->वाधु(धू)क्यम् ; -ch This is a double word. On opening bracket it will give वाधुक्यम्, वाधूक्यम्. But in the digitizing, they made entry of only वाधु out वाधु(धू)क्यम्.

It is those refinements (like ; -ch for instance) that I need to see in action.

Here's a suggestion: Complete the first 50 or 100 - some reasonable sample, and post that completed first batch. Does that sound ok?

gasyoun commented 9 years ago

Sounds reasonable.

Shalu411 commented 9 years ago

@funderburkjim Complete the first 50 or 100 - some reasonable sample, and post that completed first batch. Does that sound ok? Sure. Good idea.

Shalu411 commented 9 years ago

@gasyoun , @funderburkjim @drdhaval2785 Update now here- https://github.com/sanskrit-lexicon/CORRECTIONS/issues/117#issuecomment-106905497

gasyoun commented 8 years ago

http://www.sanskrit-lexicon.uni-koeln.de/scans/awork/apidev/servepdf.php?dict=MW&key=vAdyaBaRqa is impossible, because it's linked ot PDF and keywords do not matter here. None of the drdhaval2785.github.io/o_vs_O/output1/xx.html links work. @drdhaval2785 got the correct link?

Some longer words are cut and no way to see the original word. Can we unmake this ... shortening? 326

drdhaval2785 commented 8 years ago

Very long issue. Historical documentation purpose. Now the working of the logic is far far better. So now time to close the issue and send it to pure archival purpose.

gasyoun commented 8 years ago

Let's close it, if it's referred enough times. It's time, I agree.

sanskrit-lexicon / CORRECTIONS

`o` vs `O` #45