Alternative readings should get headword status

drdhaval2785 commented 9 years ago

e.g. http://www.sanskrit-lexicon.uni-koeln.de/scans/VCPScan/2013/web/webtc/indexcaller.php?key=tfnPa&input=slp1&output=SktDevaUnicode

capture This is the snapshot.

It is obvious that tfmPa would also mean the same meaning. But there is no headword tfmPa, so the user will never land on this data if he enters 'tfmPa'. for such alternative readings - depicted by a bracket () - we should create another headword.

gasyoun commented 9 years ago

Yes, I agree. More than that - we have already compiled such list for VCP and SKD. No final approval, but there is a .xls file made half a year ago. It's a big task to finalize it and no one around to do it. The draft is mine and @Shalu411 had a look at it, but it's not a product we can use now.

drdhaval2785 commented 9 years ago

OK. Share whatever you have. @funderburkjim can easily enlist ( ) occurences in headwords.

gasyoun commented 9 years ago

Sharing what we did with @Shalu411 half a year ago. Source file https://github.com/sanskrit-lexicon/VCP/blob/master/Vachaspatyam_b6_proof_1673-06-01-14.xlsx List of the words resolved in the .xlsx file https://github.com/sanskrit-lexicon/VCP/blob/master/Vachaspatyam_b6_proof_1673-06-01-14-headwords-with-brackets-only.txt Some cases are still fishy. Nobody was able to approve or kill them.

funderburkjim commented 9 years ago

Here's my two cents worth.

The parentheses appear in what I call the 'key2' form of the headword. In all the dictionaries but MW, the program hw0.py is responsible for identifying (in the digitization X.txt) the key2 form of the headword. Then, the program hw1.py is responsible for analyzing key2 and deducing 'key1'. Thus, it is hw1.py which is currently throwing away parenthetical alternates. And it seems plausible that making use of these alternates would be the responsibility of an enhanced hw1. As Marcis has discovered, interpreting alternate spellings can be tricky - that's mainly why I punted in the current hw1 and did not attempt the task. Another reason is that I am not clear regarding the appropriate data structures to represent alternate spellings; the solution that was used in MW was less than ideal, but it may be easier in the other dictionaries.
In terms of priorities, it seems premature to tackle the task of alternate headword spellings until we are fairly sure that typographical errors in headwords are few.

drdhaval2785 commented 9 years ago

So, kept in deep freeze till we are OK with headwords corrections.

funderburkjim commented 9 years ago

@drdhaval2785 re '...deep freeze...' That seems right to me

drdhaval2785 commented 8 years ago

@funderburkjim and @gasyoun Now I see that correction submissions are relatively few. 100s instead of 1000s earlier. Is it proper time to handle this extremely important item ?

Renaming this to 'Alternative headwords should get headword status' - Dropping VCP. It should be done in all dictionaries.

gasyoun commented 8 years ago

@drdhaval2785 it might be the time, but we can't be sure Jim agrees. In that case we would want the dhatus from PWK and PWG with upasarga combinations as ghost headwords as well. That's a hell huge topic. Do you really think we will be able to finish it? I do not think so. Because even the Sanskrit-Sanskrit dictionaries... there have been words left even @Shalu411 could not decide how to go with.

funderburkjim commented 8 years ago

@drdhaval2785 From a first glance at this, I am not clear on the objective.
Could you elaborate on

what you view the problem to be,
why the problem is 'extremely important', and
what a solution might be like?

That will help us (or me, at least) to understand whether the problem is now feasible.

drdhaval2785 commented 8 years ago

Let me use the example of tfnPa / tfmPa mentioned in the first post of this topic to clarify the questions raised by @funderburkjim.

what you view the problem to be

The uset who enters tfmPa as a query will not be able to land on desired entry or page. He should.

why the problem is 'extremely important'

Data accessibility for user. Data in a dictionary which can not be retrieved by a user is as good as non existent data.

what a solution might be like?

In the present case, soee programmatic logic can be applied which says that in tfnPa(mPa), the alternate headword is tfmPa. I know that it can be tricky with parentheses in between a headword. But still in 85% cases, morphologic similarities would make it amenable to programmatic handling.

gasyoun commented 8 years ago

@funderburkjim the importance depends. There are about 5000 of such ghost-words that should be introduced.

vach

85% cases, morphologic similarities would make it amenable to programmatic handling. I agree with Dhaval. Work on Vacaspatyam (screenshot above), Sabdakalpadruma and Apte has even verified by @Shalu411 options, so it's a question of how to submit/integrate. Something you should tell what way should go.

funderburkjim commented 8 years ago

I think it is feasible to work on this task now.

It might be useful to think of the task as having two parts.

STEP 1. Extracting (text mining) the alternate spellings of headwords from individual dictionaries.
I see the work exemplified in the Vacaspatyam being a part of this work. Although at first glance numerous questions (1) come to mind, nevertheless, it looks like a reasonable starting point for the extraction task for VCP .
- (1) sample question: for ajEkapAda(d), I would think the alternates are ajEkapAda and ajEkapAd
STEP 2. Integrating the results of step 1 into the datasets and displays of the Cologne lexicon. The details of this step are unclear to me at this time. Thus, I view this as a conceptually large step. Here are some thoughts that come to mind:
- Should this even be thought of as a task within the scope of the Cologne Sanskrit Lexicon ? Or should this be thought of as a first step in the development of a downstream project which depends on the Sanskrit Lexicon ?
- In either case, we need to think about normalizing the current Sanskrit Lexicon. What does normalizing mean here?
- aiming toward formal similarity in the various digitization text files (pw.txt, skd.txt, etc.)
- Similarly for the various xml derivatives, (pw.xml, skd.xml, mw.xml)
- Maybe finding a way to get rid of the x.txt-x.xml duality which currently exists for the dictionaries.
- If all the dictionaries had the same xml structure (with a common dtd), then we could write programs that would be applicable to any dictionary, rather than having to tailor programs to the quirks of individual dictionaries.
- This normalization is conceptually hard, because of the current intrinsic differences among the underlying dictionaries and their digitized form.
- I am intrigued by the possibility of applying current search engine technology to the Sanskrit Lexicon. The most accessible of these seems to be ElasticSearch (based on Lucene). I suspect that an Elasticsearch instance for the sanskrit lexicon would NOT be hosted at Cologne, because of its dependence on a Java-based server. But this is not an insurmountable obstacle.
- The main reason for bringing search engine technology into the picture is that there is a huge software base that could be brought to bear on our general interest in making accessible the digitized sanskrit dictionaries. This software is much more sophisticated than the current sqlite relational database software that we currently have at Cologne Sanskrit Lexicon. Such software already has the ability to handle multiple dictionaries, multiple search terms, full text search. There is a well-developed query language. In short there is a lot of potential that would be, I think, more difficult to develop 'in-house'. To make use of this, our task would be to develop software that constructs documents from our existing digitized corpus, and software that generates displays from the search indexes generated by those documents.

gasyoun commented 8 years ago

alternates are ajEkapAda and ajEkapAd, hhmm, @drdhaval2785 what's your take?
Should this even be thought of as a task within the scope of the Cologne Sanskrit Lexicon ? - why not? We do not create new content. We extract what's already there inside.
finding a way to get rid of the x.txt-x.xml duality which currently exists for the dictionaries. - does it bug you? It sure does not worry me.
This normalization is conceptually hard, because of the current intrinsic differences among the underlying dictionaries and their digitized form. - the word hard is too soft. How about impossible?
applying current search engine technology to the Sanskrit Lexicon - that's added value. What Dhaval speaks about is that we still have not reached a copy of where we were in 1850. We are not ready for full search, it will bring even more issues and because of that will not have much practical value. http://spokensanskrit.de/ is far more popular and has idiotic search, so it's not about the search. What would really matter would be a way to enter dhatus in different ways, orthographical peculiarities ignored and alternative forms presented - that's not quite Google, but will make more sense in my humble opinion.

drdhaval2785 commented 8 years ago

I have already started working on the problem and program is improving. Should be online in github repository tomorrow. Using similar orthography, edit distance, known solutions etc for suggestion of alternate headwords. Also keeping ngrams as cross validation.

Results seem promising.

drdhaval2785 commented 8 years ago

@funderburkjim and @gasyoun https://github.com/sanskrit-lexicon/alternateheadwords is the repository dedicated to this stuff. Noting it here, for sake of record.

funderburkjim commented 8 years ago

@drdhaval2785 Making separate alternateheadwords repository a good idea.

drdhaval2785 commented 7 years ago

Now this documentation item has served its purpose. New repository will flourish dictionarywise as and when we upgrade alternate headwords or embedded headwords to headword / subheadword status. Closing this.

sanskrit-lexicon / CORRECTIONS

Alternative readings should get headword status #35