Closed funderburkjim closed 9 years ago
I think most of the rxx words are not errors, but occur simply due to a spelling convention that MW does not follow (MW being the reference dictionary here); nearly half of the 9399 lines of AllvsMW.txt are of the rxx type, so this seems a useful reduction of the problem. This is just one example of how the huge problem can be reduced by this refactoring.
Absolutely true. and a worthwhile refactoring too.
Another filter would be to look for words which occur ONLY in one dictionary; e.g, there are 375 such for Wilson, 230 such for MW72, 456 for CCS, only 23 remaining in CAE, etc
Yes. I informally use this to screen out no-error headwords. Logic is if a word is there in more than one dictionary, chances are less that it may be a wrong word. Makes sense.
Smaller refactorings: a. The only data source is sanhw1.txt. The list of reference headwords (for MW in the example) are obtained from sanhw1.txt, rather than from a separate MWslp.txt file. b. The pattern and hrefyear data have been moved to a utility file faultfinder3a_utils.php, which is included in the two main php programs. The pattern data has also been put into a structure which should be easier to modify, and is easier to understand. c. In faultfinder3a-html.php, the links in the output were generated via Javascript as buttons. This also makes the html file size smaller, and helps localize the programmatic construction of the links.
I would not term any of theses small. They are great additions / alterations.
Now faultfinder3a package is in the spellcheck repository.
postprocess_scraping.php is now renamed as dictwisesorter.php.
It has been amended to work on CMD mode.
php dictwisesorter.php AllvsMW-new.html dictwiseerrors1.html
input from AllvsMW-new.html
Output into dictwiseerrors1.html.
So now the steps are
php faultfinder3a.php MW sanhw1.txt AllvsMW.txt
php faultfinder3a-html.php AllvsMW.txt AllvsMW-new.html
php dictwisesorter.php AllvsMW-new.html dictwiseerrors1.html
This will give us the needed list to work. The file size has come down from 19 mb to 3 mb because of @funderburkjim's javascript wizardry, so more amenable to opening in a browser. Now we need to apply our minds to filtering out the refactorings. Let's think it over.
1 ignore rXX 2 separate list of words occurring in single dictionary. Let's enumerate more and code for it, so that workload decreases.
3 rd addition would be to ignore 'nt' at the end. I saw many of them in dictionarywiseerrors.html in CAE e.g. aMSumant - अंशुमन्त् - CAE, PWG, PW, STC, akutsayant - अकुत्सयन्त् - CAE, SCH, akurvant - अकुर्वन्त् - CAE, CCS, PW, STC, akopayant - अकोपयन्त् - CAE, CCS, SCH, akzaRvant - अक्षण्वन्त् - CAE, CCS, PWG, PW, akziRvant - अक्षिण्वन्त् - CAE, SCH, agaRayant - अगणयन्त् - CAE, CCS, SCH, STC, aganDavant - अगन्धवन्त् - CAE,
We should ignore them. They are also different convention followed by some dictionaries. But we should device some way to give access to the data by either 'aMSumant' / 'aMSumat' both when a user searches for this word. For a native Indian user 'nt' is too foreign. It is grammatically wrong too. But as many dictionaries have used this convention - doesn't seem great idea to change them to 't'. But access to data should be there for sure.
N.B. - I could not see these patterns being detected in AllvsMW-new.html (generated via faultfinder3a.php). These were there in AllvsMW.html (generated via faultfinder3.php). So, @funderburkjim did you do some coding change to ignore such cases already ? I am not sure.
@funderburkjim well done, as usual. A few questions / thoughts before going to sleep, converted to .xls https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/AllvsMW.xls. 1) "nearly half of the 9399 lines of AllvsMW.txt are of the rxx type" there is no easy way to sort out the "rxx type" now.
sOvOryya sauvauryya VCV OvO SHS
Can be found only by OvO
and not by ryy
or even RCC
pattern - that would be even better. We have to develop rules and triggers. To have just CCC
would not tell much, but RCC
change the game. ryy
or rll
does not matters much, hope both of you will agree. It's only about the structure.
2) "look for words which occur ONLY in one dictionary" - it's something I and Dhaval have done in the past. I would only think that there should be an additional column counting the number of sources and using color formatting in .xls or .html. Sorting the number of sources is great? Guess no, alphabetical order has it's good sides. Dhaval, would it help to have the ones which have only one source with a yellow background? If only SKD and VCP has a word which none others have that might be fishy as well, so even if there are two (similar), that still should trigger another rule. Like only Indian dictionaries or Indian and European sources.
3) I can read VCCCCCV pattern, but I need to know if S in SCC is a real life s
? Or CCE - never seen and do not thing vowels should be differentiated in our case.
4) "But as many dictionaries have used this convention - doesn't seem great idea to change them to 't'. But access to data should be there for sure." - it's Boethlink's approach, he is guilty for it all. It's called European linguistics. So I guess interlinking takes this task too far. Than we have interlink much more and it grows in something gigantic which I think we are not yet ready to approach even.
5) Please make a repository or upload to an existing one the valuable sanhw1.zip file. Dropbox is for experiments. sanhw1.zip is no longer such, it's a huge step for mankind. Let it have it's legal place.
re 5) 'repository for sanhw1' - Currently, the right spot for sanhw1.txt seems to be in drdhaval2785/SanskritSpellCheck/ repository.
Question: Is there a way for me to upload just sanhw1.txt to this repository - I think I would have to clone the entire repository in GitHub client, and then worry about syncing ,etc. To me, it seems simpler to simply a dropbox link to Dhaval, and let him worry about the syncing.
Another possibility would be to put sanhw1 in the Corrections repository. Since i am responsible currently for maintaining Corrections, this would be straightforward. But then Dhaval would still need to get it over to where it is needed in SanskritSpellCheck.
So, I'm not sure of the best way to go.
re 3) S in SCC : Here, the 'S' is not a Sanskrit letter, but means 'at the start of headword'. It is the regex '^' at the beginning of a pattern.
Similarly, 'E' in those pattern codes means 'at the end of headword', the regex '$' at end of a pattern.
re 2) sOvOryya sauvauryya VCV OvO SHS.
There are two ways two filter for the 'rxx' words in AllvsMW.txt.
1. "=[^:]*r(.)\1" This finds an rxx in the pattern value. There are 4613 such currently.
2. "r(.)\1" This finds an rxx anywhere (e.g., anywhere in a headword). There are 4666 of these.
So, there are only 53 cases like sOvOryya. I guess it would better to get rid of the 4613, so we explicitly keep cases like sOvOryya in the list of suspects.
@drdhaval2785 re Where are the ants in AllvsMW-new.html?
Since AllvsMW-new.html is generated from AllvsMW.txt, the question reverts to : Where are the ants in AllvsMW.txt?
Looking at sanhw1.txt, there are 23 words in MW that end in 'nt' (two of these end in 'ant').
Since pattern 'nt at end of word' appears in the reference dictionary (MW), no word in any other dictionary will be put in the suspect list (AllvsMW.txt) on the basis of the word ending in 'nt'.
(I see 18 words ending in 'nt' in in AllvsMW.txt, but they are present for other patterns)
So, in conclusion I see no problem with AllvsMW.txt, unless there is some problem of the logic of faultfinder that I'm misunderstanding.
In fact, I am puzzled that (in AllvsMW.html from faultfinder3.php) himavant (for example) is viewed as suspect.
I cloned the repository and ran faultfinder3.php . I changed only one line in the program, line 228
fputs($outfile,givelink($dictdata[$j],$worddata[$j])."</br>\n");
adding the '\n' so the output file would have separate lines. In this rerun, only 16 lines have an 'ant ', in close agreement with AllvsMW.txt. This run uses the new version of sanhw1 (which has 2000+ 'ant' words), and the same MWslp.txt.
Probably there was some oddity in the run that generated the ants in AllvsMW.html, and its not worth spending the time to try to understand that oddity by doing a git rollback.
@funderburkjim
re Where are the ants in AllvsMW-new.html?
I wanted to trace the problem to silent it. It anyways is silent now. So we absolutely need not bother.
@funderburkjim and @gasyoun
2) "look for words which occur ONLY in one dictionary"
I guess right now we should remove these r-c-c patterns. This would substantially reduce our file size with nothing to lose.
@gasyoun
3) I can read VCCCCCV pattern
In most dictionaries the highest consecutive consonants can be kArtsnya. C(VCCCCCV).
re 'remove r-c-c' patterns for now. Agreed.
re 5) 'repository for sanhw1' - Currently, the right spot for sanhw1.txt seems to be in drdhaval2785/SanskritSpellCheck/ repository.
Let me tell you @funderburkjim that I may not be the only one using this file. The right position is in corrections repository according to me. I will copy paste from raw data of Github. Not a problematic one. One more suggestion regarding that file. Can you add \r\n instead of \n only? Notepad like windows applications run on \r\n as endline character. I see everything in single line. Difficult to read.
Dhaval, Notepad is not intended for that job. Use Notepad ++, EditPlus, EmEditor - do not use Notepad, please. sanhw1 is not only for corrections, but let it live there if Jim agrees. It's not there yet.
sanhw1 added to CORRECTIONS repository.
Using notepad++ now. No issue of \r\n survives now. That is closed from my side.
As discussed in https://github.com/sanskrit-lexicon/CORRECTIONS/issues/42#issuecomment-64976242 point 2
2 separate list of words occurring in single dictionary.
The faultfinder3a-html.php has been modified to give list of unique suspicious words by default.
Now we can do
php faultfinder3a-html.php AllvsMW.txt AllvsMW-norepeat.html
to get this. On that file we can do
php dictwisesorter.php AllvsMW-norepeat.html dictwiseerrors2.html
to get this. The data has come down to 888 kb from erstwhile 19 MB.
N.B. If you want to get the list with words repeated in dictionaries pass 1 as third argument.
php faultfinder3a-html.php AllvsMW.txt AllvsMW-repeat.html 1
Well done. Was not the part of the word in discussion formatted bold before?
Point 1 of ignoring rCC pattern is also now default.
1 ignore rXX
Now it is default in faultfinder3a-html.php committed via c78e0a52134da58f42ca356af100423e7bce2e08.
This will give you the result this.
Regarding @gasyoun 's comment
1) "nearly half of the 9399 lines of AllvsMW.txt are of the rxx type" there is no easy way to sort out the "rxx type" now.
To be precise, AllvsMW.txt has 9398 entries, and AllvsMW-norepeat.html has around 4400 entries.
N.B. If you intend to keep the words with rCC pattern also pass 2 as third argument.
php faultfinder3a-html.php AllvsMW.txt AllvsMW-repeat.html 2
Just to note - I have reverted back faultfinder3a-html.php to have href links directly rather than buttons created via javascript. Causes - 1. I want to open at least 5-6 tabs before I start exploring it. In the present system - button reopens the new word in the same tab. Not convenient to handle. Causes 2 - Now the file size is around 900 kb. So much more manageable to direct links. Browser is also happy now.
As there are not much to add in this topic, let's close this.
Well done. Was not the part of the word in discussion formatted bold before?
@gasyoun No. It was not. But now it has. http://drdhaval2785.github.io/dictwiseerrors3.html
Bold looks better. I would love to see IAST in addition, guess I'm the only one. And table form, not just a list.
IAST added here. But Tables are beyond my HTML capacities.
A modification of dictwisesorter was made so output has tables .
The output is here
Here is the program sequence:
php faultfinder3a.php MW sanhw1.txt AllvsMW.txt
php faultfinder3a-html.php AllvsMW.txt AllvsMW-norepeat.html
php dictwisesorter-v3.php AllvsMW-norepeat.html dictwiseerrors3-table.html
Comments:
Thanks for using the github.io (github project page) . I had not known it existed. Ditto for the markup
[here](url)
I would have factored things a bit differently, but yours works, so its mostly a difference in styles:
php faultfinder3a.php MW sanhw1.txt AllvsMW.txt
php dictwisesorter-vx.php AllvsMW.txt AllvsMW-norepeat.txt <-- same format as AllvsMW.txt -->
php faultfinder3a-html.php AllvsMW-norepeat.txt dictwiseerrors3-table.html 1
add option to faultfinder3a-html to output in tabular form. Do the IAST and pattern-highlighting here
You could leave the javascript, and still get the effect you prefer:
Instead of
window.open(href,\"dictionary\");
Use
window.open(href,\"_blank\");
The w3schools site has excellent information on how to do things with html, css, javascript, etc. including Html tables.
@funderburkjim great as usual. I am planning to learn these languages, but my official duties don't leave much room for it. In my spare time, yes.
Now the location for the suspect list is this. So, you can remove your list @funderburkjim to keep your github.io clutter free. Same with dropbox file of dictwisesorter-v3.php. This is also in spellcheck repository now.
N.B. - The list has reached the perfection required for a 'wrong entry' file. So no further enhancements please @gasyoun .
I keep silent, keep silent and pray you two do not stop. I'll add my two cents in the missing script section. We do not want the words alphabetically sorted (inside each table), do we? there are 375 such for Wilson, 230 such for MW72, 456 for CCS, only 23 remaining in CAE, etc.
- can we have that statistics as well?
This is a continuation of the discussion beginning 3 days ago in https://github.com/sanskrit-lexicon/CORRECTIONS/issues/39.
There are two dropbox links: a. faultfinder3a at https://dl.dropboxusercontent.com/u/29859999/AllvsMW.zip
b. sanhw1 at https://dl.dropboxusercontent.com/u/29859999/sanhw1.zip
refactoring faultfinder3
faultfinder3a.php now is a command-line program which creates a text file. Then, faultfinder3a-html.php creates an html file based on the text file. Here's the invocation used:
The text output AllvsMW.txt contains lines like:
This shows the suspect headword and the dictionaries it is in. Also, the middle field shows the pattern-abbreviation-type (VCV) = actual-pattern-in-word (aCe). It should now be easy to filter this text file in various ways to generate more manageable chunks; For example, I think most of the rxx words are not errors, but occur simply due to a spelling convention that MW does not follow (MW being the reference dictionary here); nearly half of the 9399 lines of AllvsMW.txt are of the rxx type, so this seems a useful reduction of the problem. This is just one example of how the huge problem can be reduced by this refactoring.
Another filter would be to look for words which occur ONLY in one dictionary; e.g, there are 375 such for Wilson, 230 such for MW72, 456 for CCS, only 23 remaining in CAE, etc.
We can use this issue to develop a list of useful filtered lists, and then begin working with the .html files for the various lists.
Smaller refactorings: a. The only data source is sanhw1.txt. The list of reference headwords (for MW in the example) are obtained from sanhw1.txt, rather than from a separate MWslp.txt file. b. The pattern and hrefyear data have been moved to a utility file faultfinder3a_utils.php, which is included in the two main php programs. The pattern data has also been put into a structure which should be easier to modify, and is easier to understand. c. In faultfinder3a-html.php, the links in the output were generated via Javascript as buttons. This also makes the html file size smaller, and helps localize the programmatic construction of the links.
change sort order in sanhw1
In response to a request elsewhere, I have changed the sort order appearing in sanhw1.txt. Now, M+Consonant sorts as if it were N+Consonant, where N is the nasal for the varga of the Consonant; (this does not apply to consonants yrlvSzsh; M before these consonants is unchanged for sorting).
Also, by redoing sanhw1, the list is several hundred less than a week ago, due to the many CAE and other headword corrections.