sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

faultfinder3a #42

Closed funderburkjim closed 9 years ago

funderburkjim commented 9 years ago

This is a continuation of the discussion beginning 3 days ago in https://github.com/sanskrit-lexicon/CORRECTIONS/issues/39.

There are two dropbox links: a. faultfinder3a at https://dl.dropboxusercontent.com/u/29859999/AllvsMW.zip

b. sanhw1 at https://dl.dropboxusercontent.com/u/29859999/sanhw1.zip

refactoring faultfinder3

faultfinder3a.php now is a command-line program which creates a text file. Then, faultfinder3a-html.php creates an html file based on the text file. Here's the invocation used:

php faultfinder3a.php MW sanhw1.txt AllvsMW.txt
php faultfinder3a-html.php AllvsMW.txt AllvsMW-new.html

The text output AllvsMW.txt contains lines like:

SIrzaCeda:VCV=aCe:WIL,YAT

This shows the suspect headword and the dictionaries it is in. Also, the middle field shows the pattern-abbreviation-type (VCV) = actual-pattern-in-word (aCe). It should now be easy to filter this text file in various ways to generate more manageable chunks; For example, I think most of the rxx words are not errors, but occur simply due to a spelling convention that MW does not follow (MW being the reference dictionary here); nearly half of the 9399 lines of AllvsMW.txt are of the rxx type, so this seems a useful reduction of the problem. This is just one example of how the huge problem can be reduced by this refactoring.

Another filter would be to look for words which occur ONLY in one dictionary; e.g, there are 375 such for Wilson, 230 such for MW72, 456 for CCS, only 23 remaining in CAE, etc.

We can use this issue to develop a list of useful filtered lists, and then begin working with the .html files for the various lists.

Smaller refactorings: a. The only data source is sanhw1.txt. The list of reference headwords (for MW in the example) are obtained from sanhw1.txt, rather than from a separate MWslp.txt file. b. The pattern and hrefyear data have been moved to a utility file faultfinder3a_utils.php, which is included in the two main php programs. The pattern data has also been put into a structure which should be easier to modify, and is easier to understand. c. In faultfinder3a-html.php, the links in the output were generated via Javascript as buttons. This also makes the html file size smaller, and helps localize the programmatic construction of the links.

change sort order in sanhw1

In response to a request elsewhere, I have changed the sort order appearing in sanhw1.txt. Now, M+Consonant sorts as if it were N+Consonant, where N is the nasal for the varga of the Consonant; (this does not apply to consonants yrlvSzsh; M before these consonants is unchanged for sorting).

Also, by redoing sanhw1, the list is several hundred less than a week ago, due to the many CAE and other headword corrections.

drdhaval2785 commented 9 years ago

I think most of the rxx words are not errors, but occur simply due to a spelling convention that MW does not follow (MW being the reference dictionary here); nearly half of the 9399 lines of AllvsMW.txt are of the rxx type, so this seems a useful reduction of the problem. This is just one example of how the huge problem can be reduced by this refactoring.

Absolutely true. and a worthwhile refactoring too.

Another filter would be to look for words which occur ONLY in one dictionary; e.g, there are 375 such for Wilson, 230 such for MW72, 456 for CCS, only 23 remaining in CAE, etc

Yes. I informally use this to screen out no-error headwords. Logic is if a word is there in more than one dictionary, chances are less that it may be a wrong word. Makes sense.

Smaller refactorings: a. The only data source is sanhw1.txt. The list of reference headwords (for MW in the example) are obtained from sanhw1.txt, rather than from a separate MWslp.txt file. b. The pattern and hrefyear data have been moved to a utility file faultfinder3a_utils.php, which is included in the two main php programs. The pattern data has also been put into a structure which should be easier to modify, and is easier to understand. c. In faultfinder3a-html.php, the links in the output were generated via Javascript as buttons. This also makes the html file size smaller, and helps localize the programmatic construction of the links.

I would not term any of theses small. They are great additions / alterations.

drdhaval2785 commented 9 years ago

Now faultfinder3a package is in the spellcheck repository.

drdhaval2785 commented 9 years ago

postprocess_scraping.php is now renamed as dictwisesorter.php.

It has been amended to work on CMD mode. php dictwisesorter.php AllvsMW-new.html dictwiseerrors1.html input from AllvsMW-new.html Output into dictwiseerrors1.html.

So now the steps are

php faultfinder3a.php MW sanhw1.txt AllvsMW.txt
php faultfinder3a-html.php AllvsMW.txt AllvsMW-new.html
php dictwisesorter.php AllvsMW-new.html dictwiseerrors1.html

This will give us the needed list to work. The file size has come down from 19 mb to 3 mb because of @funderburkjim's javascript wizardry, so more amenable to opening in a browser. Now we need to apply our minds to filtering out the refactorings. Let's think it over.

1 ignore rXX 2 separate list of words occurring in single dictionary. Let's enumerate more and code for it, so that workload decreases.

drdhaval2785 commented 9 years ago

3 rd addition would be to ignore 'nt' at the end. I saw many of them in dictionarywiseerrors.html in CAE e.g. aMSumant - अंशुमन्त् - CAE, PWG, PW, STC, akutsayant - अकुत्सयन्त् - CAE, SCH, akurvant - अकुर्वन्त् - CAE, CCS, PW, STC, akopayant - अकोपयन्त् - CAE, CCS, SCH, akzaRvant - अक्षण्वन्त् - CAE, CCS, PWG, PW, akziRvant - अक्षिण्वन्त् - CAE, SCH, agaRayant - अगणयन्त् - CAE, CCS, SCH, STC, aganDavant - अगन्धवन्त् - CAE,

We should ignore them. They are also different convention followed by some dictionaries. But we should device some way to give access to the data by either 'aMSumant' / 'aMSumat' both when a user searches for this word. For a native Indian user 'nt' is too foreign. It is grammatically wrong too. But as many dictionaries have used this convention - doesn't seem great idea to change them to 't'. But access to data should be there for sure.

N.B. - I could not see these patterns being detected in AllvsMW-new.html (generated via faultfinder3a.php). These were there in AllvsMW.html (generated via faultfinder3.php). So, @funderburkjim did you do some coding change to ignore such cases already ? I am not sure.

gasyoun commented 9 years ago

@funderburkjim well done, as usual. A few questions / thoughts before going to sleep, converted to .xls https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/AllvsMW.xls. 1) "nearly half of the 9399 lines of AllvsMW.txt are of the rxx type" there is no easy way to sort out the "rxx type" now.

sOvOryya    sauvauryya  VCV OvO SHS

Can be found only by OvO and not by ryy or even RCC pattern - that would be even better. We have to develop rules and triggers. To have just CCC would not tell much, but RCC change the game. ryy or rll does not matters much, hope both of you will agree. It's only about the structure.

2) "look for words which occur ONLY in one dictionary" - it's something I and Dhaval have done in the past. I would only think that there should be an additional column counting the number of sources and using color formatting in .xls or .html. Sorting the number of sources is great? Guess no, alphabetical order has it's good sides. Dhaval, would it help to have the ones which have only one source with a yellow background? If only SKD and VCP has a word which none others have that might be fishy as well, so even if there are two (similar), that still should trigger another rule. Like only Indian dictionaries or Indian and European sources.

3) I can read VCCCCCV pattern, but I need to know if S in SCC is a real life s? Or CCE - never seen and do not thing vowels should be differentiated in our case.

4) "But as many dictionaries have used this convention - doesn't seem great idea to change them to 't'. But access to data should be there for sure." - it's Boethlink's approach, he is guilty for it all. It's called European linguistics. So I guess interlinking takes this task too far. Than we have interlink much more and it grows in something gigantic which I think we are not yet ready to approach even.

5) Please make a repository or upload to an existing one the valuable sanhw1.zip file. Dropbox is for experiments. sanhw1.zip is no longer such, it's a huge step for mankind. Let it have it's legal place.

funderburkjim commented 9 years ago

re 5) 'repository for sanhw1' - Currently, the right spot for sanhw1.txt seems to be in drdhaval2785/SanskritSpellCheck/ repository.

Question: Is there a way for me to upload just sanhw1.txt to this repository - I think I would have to clone the entire repository in GitHub client, and then worry about syncing ,etc. To me, it seems simpler to simply a dropbox link to Dhaval, and let him worry about the syncing.

Another possibility would be to put sanhw1 in the Corrections repository. Since i am responsible currently for maintaining Corrections, this would be straightforward. But then Dhaval would still need to get it over to where it is needed in SanskritSpellCheck.

So, I'm not sure of the best way to go.

funderburkjim commented 9 years ago

re 3) S in SCC : Here, the 'S' is not a Sanskrit letter, but means 'at the start of headword'. It is the regex '^' at the beginning of a pattern.

Similarly, 'E' in those pattern codes means 'at the end of headword', the regex '$' at end of a pattern.

funderburkjim commented 9 years ago

re 2) sOvOryya sauvauryya VCV OvO SHS.

There are two ways two filter for the 'rxx' words in AllvsMW.txt.

1.  "=[^:]*r(.)\1"        This finds an rxx in the pattern value.  There are 4613 such currently.
2. "r(.)\1"     This finds an rxx anywhere (e.g., anywhere in a headword). There are 4666 of these.

So, there are only 53 cases like sOvOryya. I guess it would better to get rid of the 4613, so we explicitly keep cases like sOvOryya in the list of suspects.

funderburkjim commented 9 years ago

@drdhaval2785 re Where are the ants in AllvsMW-new.html?

Since AllvsMW-new.html is generated from AllvsMW.txt, the question reverts to : Where are the ants in AllvsMW.txt?

Looking at sanhw1.txt, there are 23 words in MW that end in 'nt' (two of these end in 'ant').
Since pattern 'nt at end of word' appears in the reference dictionary (MW), no word in any other dictionary will be put in the suspect list (AllvsMW.txt) on the basis of the word ending in 'nt'. (I see 18 words ending in 'nt' in in AllvsMW.txt, but they are present for other patterns)

So, in conclusion I see no problem with AllvsMW.txt, unless there is some problem of the logic of faultfinder that I'm misunderstanding.

In fact, I am puzzled that (in AllvsMW.html from faultfinder3.php) himavant (for example) is viewed as suspect.

I cloned the repository and ran faultfinder3.php . I changed only one line in the program, line 228

                    fputs($outfile,givelink($dictdata[$j],$worddata[$j])."</br>\n");

adding the '\n' so the output file would have separate lines. In this rerun, only 16 lines have an 'ant ', in close agreement with AllvsMW.txt. This run uses the new version of sanhw1 (which has 2000+ 'ant' words), and the same MWslp.txt.

Probably there was some oddity in the run that generated the ants in AllvsMW.html, and its not worth spending the time to try to understand that oddity by doing a git rollback.

drdhaval2785 commented 9 years ago

@funderburkjim

re Where are the ants in AllvsMW-new.html?

I wanted to trace the problem to silent it. It anyways is silent now. So we absolutely need not bother.

drdhaval2785 commented 9 years ago

@funderburkjim and @gasyoun

2) "look for words which occur ONLY in one dictionary"

I guess right now we should remove these r-c-c patterns. This would substantially reduce our file size with nothing to lose.

drdhaval2785 commented 9 years ago

@gasyoun

3) I can read VCCCCCV pattern

In most dictionaries the highest consecutive consonants can be kArtsnya. C(VCCCCCV).

funderburkjim commented 9 years ago

re 'remove r-c-c' patterns for now. Agreed.

drdhaval2785 commented 9 years ago
re 5) 'repository for sanhw1' - Currently, the right spot for sanhw1.txt seems to be in drdhaval2785/SanskritSpellCheck/ repository. 

Let me tell you @funderburkjim that I may not be the only one using this file. The right position is in corrections repository according to me. I will copy paste from raw data of Github. Not a problematic one. One more suggestion regarding that file. Can you add \r\n instead of \n only? Notepad like windows applications run on \r\n as endline character. I see everything in single line. Difficult to read.

gasyoun commented 9 years ago

Dhaval, Notepad is not intended for that job. Use Notepad ++, EditPlus, EmEditor - do not use Notepad, please. sanhw1 is not only for corrections, but let it live there if Jim agrees. It's not there yet.

funderburkjim commented 9 years ago
  1. Re notepad: I heartily agree with Gasyoun that file format should not be governed by the deficiencies of Notepad. So, I don't think using the Windows line termination convention (\r\n) is a good idea.
  2. I'll go along with adding sanhw1 to the Corrections repository. Will put it there the next time it is updated.
funderburkjim commented 9 years ago

sanhw1 added to CORRECTIONS repository.

drdhaval2785 commented 9 years ago

Using notepad++ now. No issue of \r\n survives now. That is closed from my side.

drdhaval2785 commented 9 years ago

As discussed in https://github.com/sanskrit-lexicon/CORRECTIONS/issues/42#issuecomment-64976242 point 2

2 separate list of words occurring in single dictionary.

The faultfinder3a-html.php has been modified to give list of unique suspicious words by default.

Now we can do

php faultfinder3a-html.php AllvsMW.txt AllvsMW-norepeat.html

to get this. On that file we can do

php dictwisesorter.php AllvsMW-norepeat.html dictwiseerrors2.html

to get this. The data has come down to 888 kb from erstwhile 19 MB.

N.B. If you want to get the list with words repeated in dictionaries pass 1 as third argument.

php faultfinder3a-html.php AllvsMW.txt AllvsMW-repeat.html 1
gasyoun commented 9 years ago

Well done. Was not the part of the word in discussion formatted bold before?

drdhaval2785 commented 9 years ago

Point 1 of ignoring rCC pattern is also now default.

1 ignore rXX

Now it is default in faultfinder3a-html.php committed via c78e0a52134da58f42ca356af100423e7bce2e08.

This will give you the result this.

Regarding @gasyoun 's comment 1) "nearly half of the 9399 lines of AllvsMW.txt are of the rxx type" there is no easy way to sort out the "rxx type" now. To be precise, AllvsMW.txt has 9398 entries, and AllvsMW-norepeat.html has around 4400 entries.

N.B. If you intend to keep the words with rCC pattern also pass 2 as third argument.

php faultfinder3a-html.php AllvsMW.txt AllvsMW-repeat.html 2
drdhaval2785 commented 9 years ago

Just to note - I have reverted back faultfinder3a-html.php to have href links directly rather than buttons created via javascript. Causes - 1. I want to open at least 5-6 tabs before I start exploring it. In the present system - button reopens the new word in the same tab. Not convenient to handle. Causes 2 - Now the file size is around 900 kb. So much more manageable to direct links. Browser is also happy now.

As there are not much to add in this topic, let's close this.

drdhaval2785 commented 9 years ago
Well done. Was not the part of the word in discussion formatted bold before?

@gasyoun No. It was not. But now it has. http://drdhaval2785.github.io/dictwiseerrors3.html

gasyoun commented 9 years ago

Bold looks better. I would love to see IAST in addition, guess I'm the only one. And table form, not just a list.

drdhaval2785 commented 9 years ago

IAST added here. But Tables are beyond my HTML capacities.

funderburkjim commented 9 years ago

A modification of dictwisesorter was made so output has tables .

The output is here

Here is the program sequence:

php faultfinder3a.php MW sanhw1.txt AllvsMW.txt
php faultfinder3a-html.php AllvsMW.txt AllvsMW-norepeat.html
php dictwisesorter-v3.php AllvsMW-norepeat.html dictwiseerrors3-table.html

Comments:

Thanks for using the github.io (github project page) . I had not known it existed. Ditto for the markup

[here](url)

I would have factored things a bit differently, but yours works, so its mostly a difference in styles:

php faultfinder3a.php MW sanhw1.txt AllvsMW.txt
php dictwisesorter-vx.php AllvsMW.txt AllvsMW-norepeat.txt   <-- same format as AllvsMW.txt -->
php faultfinder3a-html.php AllvsMW-norepeat.txt dictwiseerrors3-table.html   1
   add option to faultfinder3a-html to output in tabular form.  Do the IAST and pattern-highlighting here

You could leave the javascript, and still get the effect you prefer:

Instead of
window.open(href,\"dictionary\");
Use
window.open(href,\"_blank\");

The w3schools site has excellent information on how to do things with html, css, javascript, etc. including Html tables.

drdhaval2785 commented 9 years ago

@funderburkjim great as usual. I am planning to learn these languages, but my official duties don't leave much room for it. In my spare time, yes.

drdhaval2785 commented 9 years ago

Now the location for the suspect list is this. So, you can remove your list @funderburkjim to keep your github.io clutter free. Same with dropbox file of dictwisesorter-v3.php. This is also in spellcheck repository now.

N.B. - The list has reached the perfection required for a 'wrong entry' file. So no further enhancements please @gasyoun .

gasyoun commented 9 years ago

I keep silent, keep silent and pray you two do not stop. I'll add my two cents in the missing script section. We do not want the words alphabetically sorted (inside each table), do we? there are 375 such for Wilson, 230 such for MW72, 456 for CCS, only 23 remaining in CAE, etc. - can we have that statistics as well?