Closed gasyoun closed 3 years ago
@gasyoun Often, I don't understand your questions/suggestions. But, maybe, in this case I have a glimmer of what's on your mind.
When you mention 'reverse Sanskrit dictionary', I understand you to mean something similar to what the Lucene search engine documentation calls an 'inverted index'. Let me expand on this a bit.
In the context of the Cologne Sanskrit-Lexicon corpus, we can think of a Lucene 'document' as the headword entry for some headword in some dictionary. To be concrete, a document is comprised of the text that is displayed by the Basic Display for a given headword HW in dictionary D.
Now, we could further think of one of these so-called 'documents' as consisting of a sequence of words and abbreviations. Let's forget about the abbreviations for the moment, and just consider the words. Then, according to my understanding of your term 'reverse dictionary', we could say that a reverse dictionary for the Cologne corpus would consist of a list of form:
This would be very closely related to the 'inverted indices' of Lucene (and other information retrieval systems).
Does this abstract description sound like we're thinking along the same lines?
I also back Jim's observation that I also don't understand Marcis's observations too.... :)
@funderburkjim and @drdhaval2785 I must excuse myself, but believe me I try to be as clear as I can.
As per https://en.wikipedia.org/wiki/Inverted_index what I meant was not word level inverted index
or even record level inverted index
. Although the supplements, one of which was described above (Greek words in Sanskrit dictionaries), is equal to record level inverted index
. As stated on W: "In pre-computer times, concordances to important books were manually assembled. These were effectively inverted indexes with a small amount of accompanying commentary that required a tremendous amount of effort to produce."
The 1200+ page reference book (draft file https://yadi.sk/i/UZh1aIJMWd22K) that I'm working on for the last 2 years can be understood if http://yadi.sk/d/nAeIdM6NFTcLY Preface is read. If my English would be better there would be a German and English foreword, but as it's not, I'm not so sure.
What you think of as according to my understanding of your term 'reverse dictionary'
it is true regarding the supplements, the Greek, Hebrew, Arabic words in Sanskrit dictionaries.
Does this abstract description sound like we're thinking along the same lines?
- it does in the supllement part and in this topic I wanted to speak exactly and only about supplements. They are possible or impossible only if you like the idea and have the willingness to help me.
@gasyoun Your comment is a good beginning in explaining your idea. In order for me to decide whether I like the idea and have the willingness to help
, I'll have to understand the idea.
This will require several rounds of me asking you questions and you explaining.
It might be that this is better done in some other issue than this one -- your choice.
After a preliminary comment or two to set the context, below are my first batch of followup questions.
I downloaded both the pdfs you linked.
The pages in your reverse-250026-itrans.pdf have a similar form to the page 908 of the Schwarz-intro-dhatu-1974.pdf.
Let's start with column 1 of page 1, the first few lines of which are:
a
I ajimḥa
H acintia
P india
N utsaṅg3a
ā
It appears that each 'entry' has one of two forms:
Questions:
india
example. PWG is chosen as default, everything else has a grammatical markup (single capital letter before the word). If some word is only in MW and not in PWG, it has the additional letter. See http://pastebin.com/6kJu3BzcOk. Good explanation. Here are a couple of details I learn by using the pastebin abbreviation key with the few examples above.
ARE THE ABOVE TWO POINTS ON TARGET WITH REGARD TO YOUR PROJECT?
Here are more questions:
That's enough questions for now.
I wish I could manipulate with all words. But headwords are the most cleanest and so I'm with them. As per "it appears that you are not interested in ALL the dictionaries which have a given headword." - indeed, if PWG has it, I do not care if MW has it as well. The more that MW has copy-pasted so many entries from PWG. I have a list of priorities. PWG is above all. MW is 2nd on the list. If MW has a word PWG has not, I'll note it with a M tag. If PWK has it as well, I do not care, because PWK is 3rd in the priority list. The supplement list would deal with the text inside the articles, as the Greek and Arabic words. Both points are correct. I exclude English-Sanskrit dictionaries. Everything that can be used, everything that is not English-Sanskrit from Cologne could be used. Any other source available as a file is used as well. Nothing hype-scientific about the sources. X,Y word types of things will be in the supplements. They will contain different kinds of stats and lists. The foreign language lists will be possible if we get Jim on our side. So "if word 'W' does not appear in PWG, but appears in more than one other dictionary in your pastebin list, how does it appear in the draft of your book" is not related to PWG. Supplements will be for every dictionary where there are non-Sanskrit etymologies marked. So PWG or no PWG - does not matters in the supplements. Hope I made myself clear, thanks.
Re "I have a list of priorities." Good - If you provided a list of dictionaries in order of priority, then it might be possible to generate your 'itrans' data via a program. This list might be a list of pairs, like [('PWG',''),('MW','M'),....] where (at least for Cologne dictionaries) :
I don't know whether such programmatic generation is of interest to you.
Here are two more questions:
The generation of the main code is not the tricky part now. The supplements are. Markup of spelling differences are ignored. Both are represented without cross-linking. The logic is called reverse ordering. It's alphabetical, but from the end. Does that make any sense to you? Thanks for your questions.
Regarding the 'reverse ordering'.
Was I not clear enough, @funderburkjim ?
There have been so many postings recently, that this one has fallen off my radar.
From your last question, I gather that you were expecting some action from me here. Please remind me.
@funderburkjim actually you have already formulated everything. I badly need your help for the task, that you have reformulated half a year ago at https://github.com/sanskrit-lexicon/GreekInSanskrit/issues/17#issuecomment-13081687
X1, see HW1 in D1 X2, see HW2 in D2 etc where X1 is a word that appears under headword document HW1 in dictionary D1.
So that would give a list of all Arabic, Greek and Latin words in at least MW.
@gasyoun For Greek words, this Perseus Links document has what you are looking for, for MW.
For Greek words
Indeed. @funderburkjim what would be required for Arabic, for example?
Required word list was generated. Arabic was handled in separate repository. closing.
@funderburkjim,
I'm asking as per your understanding. Is it possible at this stage to get Unicode lists of words from different languages marked with the tags (like https://github.com/sanskrit-lexicon/GreekInSanskrit/issues/15#issuecomment-130130434)? For example to list all Arabic words from PWG or all Greek from MW or MW72? @jlreeder and @jmigliori what is your opinion? The output what I'm looking for is:
Greek in MW72
κωφήν, see kuBA
I'm thinking of adding such indexes at the end of the Reverse Sanskrit dictionary to be printed in November :fallen_leaf: in Moscow. The question is if it will be there after 2 months for at least some dictionaries in some languages. I would want to take up Lithuanian for MW and MW72, but they are pre-Unicode now and first I would ask you to get rid of the Anglicized Sanskrit. Need help? The character combinations with the numbers remain the same.