sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

Alternative readings should get headword status #35

Closed drdhaval2785 closed 7 years ago

drdhaval2785 commented 9 years ago

e.g. http://www.sanskrit-lexicon.uni-koeln.de/scans/VCPScan/2013/web/webtc/indexcaller.php?key=tfnPa&input=slp1&output=SktDevaUnicode

capture This is the snapshot.

It is obvious that tfmPa would also mean the same meaning. But there is no headword tfmPa, so the user will never land on this data if he enters 'tfmPa'. for such alternative readings - depicted by a bracket () - we should create another headword.

gasyoun commented 9 years ago

Yes, I agree. More than that - we have already compiled such list for VCP and SKD. No final approval, but there is a .xls file made half a year ago. It's a big task to finalize it and no one around to do it. The draft is mine and @Shalu411 had a look at it, but it's not a product we can use now.

drdhaval2785 commented 9 years ago

OK. Share whatever you have. @funderburkjim can easily enlist ( ) occurences in headwords.

gasyoun commented 9 years ago

Sharing what we did with @Shalu411 half a year ago. Source file https://github.com/sanskrit-lexicon/VCP/blob/master/Vachaspatyam_b6_proof_1673-06-01-14.xlsx List of the words resolved in the .xlsx file https://github.com/sanskrit-lexicon/VCP/blob/master/Vachaspatyam_b6_proof_1673-06-01-14-headwords-with-brackets-only.txt Some cases are still fishy. Nobody was able to approve or kill them.

funderburkjim commented 9 years ago

Here's my two cents worth.

  1. The parentheses appear in what I call the 'key2' form of the headword. In all the dictionaries but MW, the program hw0.py is responsible for identifying (in the digitization X.txt) the key2 form of the headword. Then, the program hw1.py is responsible for analyzing key2 and deducing 'key1'. Thus, it is hw1.py which is currently throwing away parenthetical alternates. And it seems plausible that making use of these alternates would be the responsibility of an enhanced hw1. As Marcis has discovered, interpreting alternate spellings can be tricky - that's mainly why I punted in the current hw1 and did not attempt the task. Another reason is that I am not clear regarding the appropriate data structures to represent alternate spellings; the solution that was used in MW was less than ideal, but it may be easier in the other dictionaries.
  2. In terms of priorities, it seems premature to tackle the task of alternate headword spellings until we are fairly sure that typographical errors in headwords are few.
drdhaval2785 commented 9 years ago

So, kept in deep freeze till we are OK with headwords corrections.

funderburkjim commented 9 years ago

@drdhaval2785 re '...deep freeze...' That seems right to me

drdhaval2785 commented 8 years ago

@funderburkjim and @gasyoun Now I see that correction submissions are relatively few. 100s instead of 1000s earlier. Is it proper time to handle this extremely important item ?

Renaming this to 'Alternative headwords should get headword status' - Dropping VCP. It should be done in all dictionaries.

gasyoun commented 8 years ago

@drdhaval2785 it might be the time, but we can't be sure Jim agrees. In that case we would want the dhatus from PWK and PWG with upasarga combinations as ghost headwords as well. That's a hell huge topic. Do you really think we will be able to finish it? I do not think so. Because even the Sanskrit-Sanskrit dictionaries... there have been words left even @Shalu411 could not decide how to go with.

funderburkjim commented 8 years ago

@drdhaval2785 From a first glance at this, I am not clear on the objective.
Could you elaborate on

That will help us (or me, at least) to understand whether the problem is now feasible.

drdhaval2785 commented 8 years ago

Let me use the example of tfnPa / tfmPa mentioned in the first post of this topic to clarify the questions raised by @funderburkjim.

what you view the problem to be

The uset who enters tfmPa as a query will not be able to land on desired entry or page. He should.

why the problem is 'extremely important'

Data accessibility for user. Data in a dictionary which can not be retrieved by a user is as good as non existent data.

what a solution might be like?

In the present case, soee programmatic logic can be applied which says that in tfnPa(mPa), the alternate headword is tfmPa. I know that it can be tricky with parentheses in between a headword. But still in 85% cases, morphologic similarities would make it amenable to programmatic handling.

gasyoun commented 8 years ago

@funderburkjim the importance depends. There are about 5000 of such ghost-words that should be introduced.

vach

85% cases, morphologic similarities would make it amenable to programmatic handling. I agree with Dhaval. Work on Vacaspatyam (screenshot above), Sabdakalpadruma and Apte has even verified by @Shalu411 options, so it's a question of how to submit/integrate. Something you should tell what way should go.

funderburkjim commented 8 years ago

I think it is feasible to work on this task now.

It might be useful to think of the task as having two parts.

gasyoun commented 8 years ago
drdhaval2785 commented 8 years ago

I have already started working on the problem and program is improving. Should be online in github repository tomorrow. Using similar orthography, edit distance, known solutions etc for suggestion of alternate headwords. Also keeping ngrams as cross validation.

Results seem promising.

drdhaval2785 commented 8 years ago

@funderburkjim and @gasyoun https://github.com/sanskrit-lexicon/alternateheadwords is the repository dedicated to this stuff. Noting it here, for sake of record.

funderburkjim commented 8 years ago

@drdhaval2785 Making separate alternateheadwords repository a good idea.

drdhaval2785 commented 7 years ago

Now this documentation item has served its purpose. New repository will flourish dictionarywise as and when we upgrade alternate headwords or embedded headwords to headword / subheadword status. Closing this.