Normalization steps - Githubissues

drdhaval2785 commented 8 years ago

Right now the output is placed in normalization subdirectory.

Responsible code is function countlen() in hwnorm1.py.

Let me document the steps.

hw1.txt - headwords of sanhw1.txt sorted alphabetically (python order. Not Sanskrit order).
hw2.txt - hw1.txt after normalization of anusvAra ([NYRnm][consonant] -> M[consonant]. Also terminal 'M' converted to 'm')
hw3.txt - hw2.txt after normalization of duplication ( r[consonant][consonant] -> r[consonant] conversion).
hw4.txt - hw3.txt after normalization for 'ant' at end.
hw5.txt - hw4.txt after normalization of terminal 'm' and 'H' ( [aA][mH]$ -> [aA]$ )

There are four difference files generated in the process.

hw1minushw2.txt - hw1.txt entries not found in hw2.txt
hw2minushw3.txt - hw2.txt entries not found in hw3.txt
hw3minushw4.txt - hw3.txt entries not found in hw4.txt
hw4minushw5.txt - hw4.txt entries not found in hw3.txt

I hope someone would cursorily examine the files and decide whether we are on right track or not.

gasyoun commented 8 years ago

hw1minushw2.txt - hw1.txt entries not found in hw2.txt - what do you expect it to give? Please illustrate, can't grasp. Brain too weak. I wonder how many hw4.txt words were added after you killed terminal 'm' and 'H', maybe a list with links (with terminal 'm' and 'H' added again for the links to work) would help check the original thesis?

drdhaval2785 commented 8 years ago

@gasyoun These minus files are meant to be checked whether some non-deserving candidate is not removed in the process. I don't see much issue in hw1minushw2 and hw2minushw3. But 'M' and 'H' removal is a bit tricky. Therefore that list (hw3minushw4.txt) needs to be checked properly.

gasyoun commented 8 years ago

But 'M' and 'H' removal is a bit tricky. - remember some tricky example?

drdhaval2785 commented 8 years ago

https://github.com/sanskrit-lexicon/hwnorm1/blob/master/normalization/examine4.txt

Full of tricky examples.

Right now keeping them in examine.txt file. Whatever found OK - goes to the next step. Otherwise dies here.

gasyoun commented 8 years ago

6k lines is too much. Even a 100 list of feminine words wrongly tagged in MW I will work on for a month. Where do you state the algo? Why yogyatAjYAnasyaSabdaMpratikAraRatAvicAraH is fishy? Because of aH? Are these words included? Or should be not? Please add details.

drdhaval2785 commented 8 years ago

Where do you state the algo?

The words which are not found in the sanhw1.txt after removal of terminal 'm' and 'H'.

6k lines is too much.

Now it is 2456. Not much further improvement expected by computer algorithm. Manual only. We are not in hurry. At least we should not normalize words which don't deserve to be normalized.

gasyoun commented 8 years ago

So these are the possible additions to the European style dictionaries, understood. 2.5k is far better. Only a few years away. Have you found at least 1 wrong this way? I guess it's one of the hardest methods.

sanskrit-lexicon / hwnorm1

Normalization steps #3