non-Sanskrit Word List - Githubissues

gasyoun commented 8 years ago

@funderburkjim,

I'm asking as per your understanding. Is it possible at this stage to get Unicode lists of words from different languages marked with the tags (like https://github.com/sanskrit-lexicon/GreekInSanskrit/issues/15#issuecomment-130130434)? For example to list all Arabic words from PWG or all Greek from MW or MW72? @jlreeder and @jmigliori what is your opinion? The output what I'm looking for is:

Greek in MW72

κοντός, see kunta
κυφός, see kubja
κωφήν, see kuBA

I'm thinking of adding such indexes at the end of the Reverse Sanskrit dictionary to be printed in November :fallen_leaf: in Moscow. The question is if it will be there after 2 months for at least some dictionaries in some languages. I would want to take up Lithuanian for MW and MW72, but they are pre-Unicode now and first I would ask you to get rid of the Anglicized Sanskrit. Need help? The character combinations with the numbers remain the same.

funderburkjim commented 8 years ago

@gasyoun Often, I don't understand your questions/suggestions. But, maybe, in this case I have a glimmer of what's on your mind.

When you mention 'reverse Sanskrit dictionary', I understand you to mean something similar to what the Lucene search engine documentation calls an 'inverted index'. Let me expand on this a bit.

In the context of the Cologne Sanskrit-Lexicon corpus, we can think of a Lucene 'document' as the headword entry for some headword in some dictionary. To be concrete, a document is comprised of the text that is displayed by the Basic Display for a given headword HW in dictionary D.

Now, we could further think of one of these so-called 'documents' as consisting of a sequence of words and abbreviations. Let's forget about the abbreviations for the moment, and just consider the words. Then, according to my understanding of your term 'reverse dictionary', we could say that a reverse dictionary for the Cologne corpus would consist of a list of form:

X1, see HW1 in D1
X2, see HW2 in D2
etc where X1 is a word that appears under headword document HW1 in dictionary D1.

This would be very closely related to the 'inverted indices' of Lucene (and other information retrieval systems).

Does this abstract description sound like we're thinking along the same lines?

drdhaval2785 commented 8 years ago

I also back Jim's observation that I also don't understand Marcis's observations too.... :)

gasyoun commented 8 years ago

@funderburkjim and @drdhaval2785 I must excuse myself, but believe me I try to be as clear as I can. As per https://en.wikipedia.org/wiki/Inverted_index what I meant was not word level inverted index or even record level inverted index. Although the supplements, one of which was described above (Greek words in Sanskrit dictionaries), is equal to record level inverted index. As stated on W: "In pre-computer times, concordances to important books were manually assembled. These were effectively inverted indexes with a small amount of accompanying commentary that required a tremendous amount of effort to produce." The 1200+ page reference book (draft file https://yadi.sk/i/UZh1aIJMWd22K) that I'm working on for the last 2 years can be understood if http://yadi.sk/d/nAeIdM6NFTcLY Preface is read. If my English would be better there would be a German and English foreword, but as it's not, I'm not so sure. What you think of as according to my understanding of your term 'reverse dictionary' it is true regarding the supplements, the Greek, Hebrew, Arabic words in Sanskrit dictionaries. Does this abstract description sound like we're thinking along the same lines? - it does in the supllement part and in this topic I wanted to speak exactly and only about supplements. They are possible or impossible only if you like the idea and have the willingness to help me.

funderburkjim commented 8 years ago

@gasyoun Your comment is a good beginning in explaining your idea. In order for me to decide whether I like the idea and have the willingness to help, I'll have to understand the idea.

This will require several rounds of me asking you questions and you explaining.

It might be that this is better done in some other issue than this one -- your choice.

After a preliminary comment or two to set the context, below are my first batch of followup questions.

I downloaded both the pdfs you linked.

The pages in your reverse-250026-itrans.pdf have a similar form to the page 908 of the Schwarz-intro-dhatu-1974.pdf.

Let's start with column 1 of page 1, the first few lines of which are:

a
I ajimḥa
H acintia
P india
N utsaṅg3a
ā

It appears that each 'entry' has one of two forms:

one word
a Capital Letter + a word

Questions:

What is the source of the words? Most appear to be Sanskrit in IAST, but some (like 'india') are English. How are the words chosen?
In the case of the second form (X word), what is the meaning of the X (e.g., I, H, P, N)
In the case of the first form (word, no X) , why is there no X?

gasyoun commented 8 years ago

Source is Cologne. Some Sanskrit words are actually Prakrit, like the india example. PWG is chosen as default, everything else has a grammatical markup (single capital letter before the word). If some word is only in MW and not in PWG, it has the additional letter. See http://pastebin.com/6kJu3Bzc
It is the name of dictionary, usually the first letter. So your MW is equal to my M.
It's because it's in PWG and PWG is the primary source. Similar principle is there in Schwarz. I was unaware of his principles and chose PWG on my own. Surpised I was to find out that he had done the same.

funderburkjim commented 8 years ago

Ok. Good explanation. Here are a couple of details I learn by using the pastebin abbreviation key with the few examples above.

The words are headwords. This is a place where my understanding of your idea was off the mark. When I was discussing inverted indices, I was thinking in terms of an index of ALL WORDS in the dictionary entries. You're doing something JUST WITH HEADWORDS.
Also, it appears that you are not interested in ALL the dictionaries which have a given headword. For instance, the headword 'a' in the examples occurs in PWG (we know that because the absence of a dictionary letter in front of 'a' means exactly that the word is a PWG headword). However, of course many dictionaries besides PWG have that 'a' as a headword. However, the list of those additional dictionaries must not matter to your investigation.

ARE THE ABOVE TWO POINTS ON TARGET WITH REGARD TO YOUR PROJECT?

Here are more questions:

Consider the example 'P india'. The 'P' tells us that 'india' occurs as a headword in the Puranic Encyclopedia. Question: However, 'india' is also a headword in MWE (English-Sanskrit). Are you excluding the English-Sanskrit dictionaries? (I don't see them in your pastebin list)
You have a few non-Cologne dictionaries in your pastebin list, and a few Cologne dictionaries are not in your pastebin list. How did you choose the list of dictionaries?
Are there any entries 'X,Y word' in your list? (i.e., 'word' occurs as a headword in dictionaries X and Y, but not in PWG?)
Relatedly, if word 'W' does not appear in PWG, but appears in more than one other dictionary in your pastebin list, how does it appear in the draft of your book?

That's enough questions for now.

gasyoun commented 8 years ago

I wish I could manipulate with all words. But headwords are the most cleanest and so I'm with them. As per "it appears that you are not interested in ALL the dictionaries which have a given headword." - indeed, if PWG has it, I do not care if MW has it as well. The more that MW has copy-pasted so many entries from PWG. I have a list of priorities. PWG is above all. MW is 2nd on the list. If MW has a word PWG has not, I'll note it with a M tag. If PWK has it as well, I do not care, because PWK is 3rd in the priority list. The supplement list would deal with the text inside the articles, as the Greek and Arabic words. Both points are correct. I exclude English-Sanskrit dictionaries. Everything that can be used, everything that is not English-Sanskrit from Cologne could be used. Any other source available as a file is used as well. Nothing hype-scientific about the sources. X,Y word types of things will be in the supplements. They will contain different kinds of stats and lists. The foreign language lists will be possible if we get Jim on our side. So "if word 'W' does not appear in PWG, but appears in more than one other dictionary in your pastebin list, how does it appear in the draft of your book" is not related to PWG. Supplements will be for every dictionary where there are non-Sanskrit etymologies marked. So PWG or no PWG - does not matters in the supplements. Hope I made myself clear, thanks.

funderburkjim commented 8 years ago

Re "I have a list of priorities." Good - If you provided a list of dictionaries in order of priority, then it might be possible to generate your 'itrans' data via a program. This list might be a list of pairs, like [('PWG',''),('MW','M'),....] where (at least for Cologne dictionaries) :

For a pair (X,Y), X would be the Cologne dictionary code and Y would be your 1-letter dictionary code
The pairs would be listed in order of decreasing priority.

I don't know whether such programmatic generation is of interest to you.

Here are two more questions:

How have you handled alternate spellings? For instance, the 'rxx' spelling difference (and the others as indicted in the hwnorm1 exercise.)
I tried to explore the alternate spelling question by search for 'karmma' in the itrans pdf. This is a headword in SKD. I think I never found it. But, in the process of trying to find it, I was baffled by the ordering of words in the pdf. Is there some logic to the ordering? Alphabetical by word would seem a logical ordering choice, and might make the listing a more useful reference.

gasyoun commented 8 years ago

The generation of the main code is not the tricky part now. The supplements are. Markup of spelling differences are ignored. Both are represented without cross-linking. The logic is called reverse ordering. It's alphabetical, but from the end. Does that make any sense to you? Thanks for your questions.

funderburkjim commented 8 years ago

Regarding the 'reverse ordering'.

As a general comment, I find it hard to navigate (find things) in this ordering. The only place I've come across reverse ordering is as a technical 'kluge' to permit substring searches in Lucene. For example, if you want Lucene to be able to search documents containing words ending in 'guru', then you need to add the reverse-spelled words in a field, (urug, urugAham, etc. for words 'guru, mahAguru', etc.)
I'm guessing that 'reverse ordering' means that the words are sorted when spelled backwards. Is that what you mean?
In reverse-250026-itrans.pdf, the first couple of pages seem not to follow the reverse-ordering pattern. However, starting with 'ka' in 2nd column of page 3, the pattern emerges. One artifact of the columnar presentation is that long words 'word-wrap' to the next line, so there are numerous apparent instances of 'a', 'aka', etc. which are really just continuations of long words from the prior line.
If you stick with the reverse ordering, you might consider using a fixed-width font. This would help the eye navigate the ordering, as the eye is already heavily burdened by having to read words backward when searching for a given word.
What do you mean by the 'supplements are the tricky part now' ? What are the supplements?

gasyoun commented 8 years ago

Not that hard. practise makes the master.
Yes.
Yes, there is this 'word-wrap' to the next line issue for 3000 words and there will be a gap, to distinguish them.
Nice idea, but not aware of anything good for a book. It's not web after all.
The list of Arabic and Greeks words are, for example, the possible supplements. Other supllements will include: https://github.com/sanskrit-lexicon/GreekInSanskrit/blob/master/CVCVCVCV.pdf https://github.com/sanskrit-lexicon/GreekInSanskrit/blob/master/accent-percent.pdf https://github.com/sanskrit-lexicon/GreekInSanskrit/blob/master/endings.pdf https://github.com/sanskrit-lexicon/GreekInSanskrit/blob/master/praefixoids.pdf

gasyoun commented 8 years ago

Was I not clear enough, @funderburkjim ?

funderburkjim commented 8 years ago

There have been so many postings recently, that this one has fallen off my radar.

From your last question, I gather that you were expecting some action from me here. Please remind me.

gasyoun commented 8 years ago

@funderburkjim actually you have already formulated everything. I badly need your help for the task, that you have reformulated half a year ago at https://github.com/sanskrit-lexicon/GreekInSanskrit/issues/17#issuecomment-13081687

X1, see HW1 in D1 X2, see HW2 in D2 etc where X1 is a word that appears under headword document HW1 in dictionary D1.

So that would give a list of all Arabic, Greek and Latin words in at least MW.

funderburkjim commented 8 years ago

@gasyoun For Greek words, this Perseus Links document has what you are looking for, for MW.

gasyoun commented 4 years ago

For Greek words

Indeed. @funderburkjim what would be required for Arabic, for example?

drdhaval2785 commented 3 years ago

Required word list was generated. Arabic was handled in separate repository. closing.

sanskrit-lexicon / GreekInSanskrit

non-Sanskrit Word List #17