Open drdhaval2785 opened 8 years ago
hw1minushw2.txt - hw1.txt entries not found in hw2.txt
- what do you expect it to give? Please illustrate, can't grasp. Brain too weak. I wonder how many hw4.txt words were added after you killed terminal 'm' and 'H', maybe a list with links (with terminal 'm' and 'H' added again for the links to work) would help check the original thesis?
@gasyoun These minus files are meant to be checked whether some non-deserving candidate is not removed in the process. I don't see much issue in hw1minushw2 and hw2minushw3. But 'M' and 'H' removal is a bit tricky. Therefore that list (hw3minushw4.txt) needs to be checked properly.
But 'M' and 'H' removal is a bit tricky. - remember some tricky example?
https://github.com/sanskrit-lexicon/hwnorm1/blob/master/normalization/examine4.txt
Full of tricky examples.
Right now keeping them in examine.txt file. Whatever found OK - goes to the next step. Otherwise dies here.
6k lines is too much. Even a 100 list of feminine words wrongly tagged in MW I will work on for a month. Where do you state the algo? Why yogyatAjYAnasyaSabdaMpratikAraRatAvicAraH
is fishy? Because of aH
? Are these words included? Or should be not? Please add details.
Where do you state the algo?
The words which are not found in the sanhw1.txt after removal of terminal 'm' and 'H'.
6k lines is too much.
Now it is 2456. Not much further improvement expected by computer algorithm. Manual only. We are not in hurry. At least we should not normalize words which don't deserve to be normalized.
So these are the possible additions to the European style dictionaries, understood. 2.5k is far better. Only a few years away. Have you found at least 1 wrong this way? I guess it's one of the hardest methods.
Right now the output is placed in normalization subdirectory.
Responsible code is function
countlen()
inhwnorm1.py
.Let me document the steps.
There are four difference files generated in the process.
I hope someone would cursorily examine the files and decide whether we are on right track or not.