Closed drdhaval2785 closed 8 years ago
I would add a tag for them, so they could be cleaned after, but not at sanhw1.txt
level. Dhaval, if we kill Tamil, we will have to kill Arabic, English, Greek and Prakrit words - known and unknown. It's a fact that Sanskrit dictionaries have always included Prakrit words. This itself is a major field of research and one I would not get into. I propose to mark such words, to have them killed in some sanhw3.txt
, but not in the general headword index. It's a matter of principle. I agree we should try to ease the batch error identification process, but only as a byproduct, not changing the initial word list. @funderburkjim agree?
There is no need to kill Arabic, English etc. I am NOT asking to kill foreign entries from ALL dictionaries. I AM asking to kill them from IEG headwords incorporated in sanhw1.txt and sanhw2.txt.
I am suggesting this because they will generate all sort of wrong secondary data. Of course I can code to ignore IEG for all practical purposes, but it doesn't make sense to add tons of foreign words to this sanhw1.txt and sanhw2.txt lists only because IEG has it. If there is any other dictionary having them, keep them.
At sanhw1.txt or sanhw2.txt level, I don't have any mechanism to identify whether the word is marked Tamil in IEG or not.
Let's introduce additional files-lists with foreign words. When working with sanhw1.txt
let's parse them as well preliminary.
It seems that a practical solution to the IEG problem is for Dhaval to maintain a list of words that he identifies as needing to be excluded in his analysis. Then, Using this list and the existing sanhw1.txt he can easily construct a private version of sanhw1.txt where these exclusions have been made; by private I mean that it will NOT be a part of CORRECTIONS repository, but some file on Dhaval's local computer. Then, in one of the selection programs he develops, he can use this private version of sanhw1 for filtering.
@drdhaval2785 Would this approach solve the problem?
@funderburkjim it can be part of CORRECTIONS under a different name. We created this place, because in 2014 I had too many of those local files. Too many, Jim :hamster:
@drdhaval2785 OK. So you don't want another local file. And I don't want to clutter up CORRECTIONS repository. Hmm.
What if we made an IEG repository, and put a modified sanhw1-whatever in there? Would that work for you?
@drdhaval2785 If that idea doesn't suit, I'm inclined to go with your original suggestion, and to remove the IEG-tamil headwords from sanhw1/2, at least for the duration of your headword studies.
Please not remove them. The studies will last longer than expected.
Ok. Considering the issues raised by Jim and Gasyoun, i have ignored IEG from list of tested dictionaries for now. So Tamil or no Tamil would not make an difference to me. Treat the issue as closed.
Thanks, Dhaval!
@drdhaval2785 not to say I agree, but still.
IEG dictionary has a lot of Tamil headwords. It also has an identifiable pattern for Tamil headwords.
r' Tamil[;]'
Total 375 such entries.
If they are removed from sanhw1.txt / sanhw2.txt, nothing would be lost and they would be a lot cleaner actually. After all they are sanskritheadword1 and sanskritheadword2.