Closed drdhaval2785 closed 7 years ago
Yes, I agree. More than that - we have already compiled such list for VCP and SKD. No final approval, but there is a .xls file made half a year ago. It's a big task to finalize it and no one around to do it. The draft is mine and @Shalu411 had a look at it, but it's not a product we can use now.
OK. Share whatever you have. @funderburkjim can easily enlist ( ) occurences in headwords.
Sharing what we did with @Shalu411 half a year ago. Source file https://github.com/sanskrit-lexicon/VCP/blob/master/Vachaspatyam_b6_proof_1673-06-01-14.xlsx List of the words resolved in the .xlsx file https://github.com/sanskrit-lexicon/VCP/blob/master/Vachaspatyam_b6_proof_1673-06-01-14-headwords-with-brackets-only.txt Some cases are still fishy. Nobody was able to approve or kill them.
Here's my two cents worth.
So, kept in deep freeze till we are OK with headwords corrections.
@drdhaval2785 re '...deep freeze...' That seems right to me
@funderburkjim and @gasyoun Now I see that correction submissions are relatively few. 100s instead of 1000s earlier. Is it proper time to handle this extremely important item ?
Renaming this to 'Alternative headwords should get headword status' - Dropping VCP. It should be done in all dictionaries.
@drdhaval2785 it might be the time, but we can't be sure Jim agrees. In that case we would want the dhatus from PWK and PWG with upasarga combinations as ghost headwords as well. That's a hell huge topic. Do you really think we will be able to finish it? I do not think so. Because even the Sanskrit-Sanskrit dictionaries... there have been words left even @Shalu411 could not decide how to go with.
@drdhaval2785 From a first glance at this, I am not clear on the objective.
Could you elaborate on
That will help us (or me, at least) to understand whether the problem is now feasible.
Let me use the example of tfnPa / tfmPa mentioned in the first post of this topic to clarify the questions raised by @funderburkjim.
what you view the problem to be
The uset who enters tfmPa as a query will not be able to land on desired entry or page. He should.
why the problem is 'extremely important'
Data accessibility for user. Data in a dictionary which can not be retrieved by a user is as good as non existent data.
what a solution might be like?
In the present case, soee programmatic logic can be applied which says that in tfnPa(mPa), the alternate headword is tfmPa. I know that it can be tricky with parentheses in between a headword. But still in 85% cases, morphologic similarities would make it amenable to programmatic handling.
@funderburkjim the importance depends. There are about 5000 of such ghost-words that should be introduced.
85% cases, morphologic similarities would make it amenable to programmatic handling.
I agree with Dhaval. Work on Vacaspatyam (screenshot above), Sabdakalpadruma and Apte has even verified by @Shalu411 options, so it's a question of how to submit/integrate. Something you should tell what way should go.
I think it is feasible to work on this task now.
It might be useful to think of the task as having two parts.
alternates are ajEkapAda and ajEkapAd
, hhmm, @drdhaval2785 what's your take?Should this even be thought of as a task within the scope of the Cologne Sanskrit Lexicon ?
- why not? We do not create new content. We extract what's already there inside.finding a way to get rid of the x.txt-x.xml duality which currently exists for the dictionaries.
- does it bug you? It sure does not worry me.This normalization is conceptually hard, because of the current intrinsic differences among the underlying dictionaries and their digitized form.
- the word hard is too soft. How about impossible?applying current search engine technology to the Sanskrit Lexicon
- that's added value. What Dhaval speaks about is that we still have not reached a copy of where we were in 1850. We are not ready for full search, it will bring even more issues and because of that will not have much practical value. http://spokensanskrit.de/ is far more popular and has idiotic search, so it's not about the search. What would really matter would be a way to enter dhatus in different ways, orthographical peculiarities ignored and alternative forms presented - that's not quite Google, but will make more sense in my humble opinion.I have already started working on the problem and program is improving. Should be online in github repository tomorrow. Using similar orthography, edit distance, known solutions etc for suggestion of alternate headwords. Also keeping ngrams as cross validation.
Results seem promising.
@funderburkjim and @gasyoun https://github.com/sanskrit-lexicon/alternateheadwords is the repository dedicated to this stuff. Noting it here, for sake of record.
@drdhaval2785 Making separate alternateheadwords repository a good idea.
Now this documentation item has served its purpose. New repository will flourish dictionarywise as and when we upgrade alternate headwords or embedded headwords to headword / subheadword status. Closing this.
e.g. http://www.sanskrit-lexicon.uni-koeln.de/scans/VCPScan/2013/web/webtc/indexcaller.php?key=tfnPa&input=slp1&output=SktDevaUnicode
This is the snapshot.
It is obvious that tfmPa would also mean the same meaning. But there is no headword tfmPa, so the user will never land on this data if he enters 'tfmPa'. for such alternative readings - depicted by a bracket () - we should create another headword.