Remove Tamil from sanhw1 / sanhw2

drdhaval2785 commented 8 years ago

IEG dictionary has a lot of Tamil headwords. It also has an identifiable pattern for Tamil headwords.

r' Tamil[;]'

Total 375 such entries.

If they are removed from sanhw1.txt / sanhw2.txt, nothing would be lost and they would be a lot cleaner actually. After all they are sanskritheadword1 and sanskritheadword2.

gasyoun commented 8 years ago

I would add a tag for them, so they could be cleaned after, but not at sanhw1.txt level. Dhaval, if we kill Tamil, we will have to kill Arabic, English, Greek and Prakrit words - known and unknown. It's a fact that Sanskrit dictionaries have always included Prakrit words. This itself is a major field of research and one I would not get into. I propose to mark such words, to have them killed in some sanhw3.txt, but not in the general headword index. It's a matter of principle. I agree we should try to ease the batch error identification process, but only as a byproduct, not changing the initial word list. @funderburkjim agree?

drdhaval2785 commented 8 years ago

There is no need to kill Arabic, English etc. I am NOT asking to kill foreign entries from ALL dictionaries. I AM asking to kill them from IEG headwords incorporated in sanhw1.txt and sanhw2.txt.

I am suggesting this because they will generate all sort of wrong secondary data. Of course I can code to ignore IEG for all practical purposes, but it doesn't make sense to add tons of foreign words to this sanhw1.txt and sanhw2.txt lists only because IEG has it. If there is any other dictionary having them, keep them.

At sanhw1.txt or sanhw2.txt level, I don't have any mechanism to identify whether the word is marked Tamil in IEG or not.

gasyoun commented 8 years ago

Let's introduce additional files-lists with foreign words. When working with sanhw1.txt let's parse them as well preliminary.

funderburkjim commented 8 years ago

It seems that a practical solution to the IEG problem is for Dhaval to maintain a list of words that he identifies as needing to be excluded in his analysis. Then, Using this list and the existing sanhw1.txt he can easily construct a private version of sanhw1.txt where these exclusions have been made; by private I mean that it will NOT be a part of CORRECTIONS repository, but some file on Dhaval's local computer. Then, in one of the selection programs he develops, he can use this private version of sanhw1 for filtering.

@drdhaval2785 Would this approach solve the problem?

gasyoun commented 8 years ago

@funderburkjim it can be part of CORRECTIONS under a different name. We created this place, because in 2014 I had too many of those local files. Too many, Jim :hamster:

funderburkjim commented 8 years ago

@drdhaval2785 OK. So you don't want another local file. And I don't want to clutter up CORRECTIONS repository. Hmm.

What if we made an IEG repository, and put a modified sanhw1-whatever in there? Would that work for you?

funderburkjim commented 8 years ago

@drdhaval2785 If that idea doesn't suit, I'm inclined to go with your original suggestion, and to remove the IEG-tamil headwords from sanhw1/2, at least for the duration of your headword studies.

gasyoun commented 8 years ago

Please not remove them. The studies will last longer than expected.

drdhaval2785 commented 8 years ago

Ok. Considering the issues raised by Jim and Gasyoun, i have ignored IEG from list of tested dictionaries for now. So Tamil or no Tamil would not make an difference to me. Treat the issue as closed.

funderburkjim commented 8 years ago

Thanks, Dhaval!

gasyoun commented 8 years ago

@drdhaval2785 not to say I agree, but still.

sanskrit-lexicon / CORRECTIONS

Remove Tamil from sanhw1 / sanhw2 #234