sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

Word Weight vs. Alphabetical Order in Search Results #12

Closed Shalu411 closed 3 years ago

Shalu411 commented 10 years ago

Namaste In the Advanced display parameters- "Maximum" 20, 50, 100, 200 etc. is for output options. Right? (Please suggest that with "Maximum Words/results in output"- that's a different issue)

Here my point is the exact word searched for (that particular spelling-ed word) should be given first preference of appearance. Others should follow, only later. So if I am searching for a very popular word and it has 1000s of appearances, then I will have to keep on increasing my Maximum output - until I find it. This happened recently to me, when I was trying to explain how to use this page to my friend.

So this thing could be rectified by 1. giving exact word first and 2. Arranging the entries not by alphabet order, but by maximum number of occurence of the word in the text of that particular suggested word. Thankyou.

Shalu411 commented 10 years ago

Namaste http://www.sanskrit-lexicon.uni-koeln.de/scans/PWGScan/2013/web/webtc2/index.php On same issue above- more light. I gave a querry regarding "nI" (keeping dhAtu in mind). But the output starts with - 1 अग्रणी,2 अग्रु,3 अग्रेणी 4 अङ्किन्5 अतस6 अतिथिग्व7 अतिशय8 अदर्शन9 अद्मसद्य10 अद्यतन11 अधिरुक्म12 अधोवर्चस्13 अन्14 अनीक15 अनीकवन्त्16 अनुनय17 अनुनय18 अनुष्ठान19 अन्तर्20 अन्तिक- (When Maximum output limit is set to 20). Then when will the turn of "nI" as a separate entry come?

This is better on http://www.sanskrit-lexicon.uni-koeln.de/scans/PWGScan/disp2/index.php- older search page - where I get the exact word in a second. So one can choose between these two displays.

[This can be put as an explanation in FAQ-question of yours- "Which is best dictionary for me?" :)]

Even in Advanced Search page, this issue can be set right by providing the searched word (here-"nI") independently first- as I had suggested earlier. And words contaning "nI" in the Head Word be given first preference. Here- These words could be happily sent to back or kept out of that small output list. 2 अग्रु,4 अङ्किन्5 अतस6 अतिथिग्व7 अतिशय8 अदर्शन9 अद्मसद्य10 अद्यतन11 अधिरुक्म12 अधोवर्चस्13 अन् 16 अनुनय17 अनुनय18 अनुष्ठान19 अन्तर्20 अन्तिक- Thankyou

funderburkjim commented 10 years ago
  1. Regarding ""Maximum" 20, 50, 100, 200 etc. " This is the maximum number of records (approximately equal to the number of dictionary headwords) that are to be retrieved at one time. Take 'nI' (Sanskrit word, exact) with max set at 20, Then the left-hand pane shows the headword for 20 records (Why 'agraNI' is there, I'll discuss in point 2). Do you notice that now there is a 'Next' button to the right of the 'Search' button. If you click the 'Next' button, you get the next batch of 20 records, starting in this case with apanaya. Thus, by repeatedly clicking 'Next' you get successive chunks of the Maximum number of records. So, you don't need to fiddle with Maximum to get all the way through, you can do so with Next. However, for a search like 'nI', which may have many matches, you might want to take a bigger 'Maximum' chunk size. One thing I notice which would be an enhancement is to have a 'Previous' button also.
  2. For the above 'nI' search, why does the list start with 'agraNI' ? The technical reason is that 'nI' appears in parentheses in the body of the definition of agraNI: (agra nī) SIDDH.K. zu P.8,4,14. Here's a fuller description of what the search does in case of PWG. In PWG, the text is multilingual (usually, German and Sanskrit). For instance, if you do an exact match for Sanskrit word 'asmAkaM', agraNI is again the first headword shown, since the digitization has also marked asmAkaM as Sanskrit, and it appears in the text of headword agraNI. So, a search for a 'Sanskrit Word' looks in two places: (a) the headwords and (b) the parts of the text definitions that are marked as Sanskrit. If you look for 'asmAkaM' as a 'Text Word', then there is no match; a 'Text Word' search in PWG looks only at the text NOT marked as Sanskrit. If you do a text search for anführend (with the umlaut), you also don't get any matches, but if you drop the umlaut and search for anfuhrend, then you get agraNI and 4 other headwords in whose text the non-Sanskrit word appears. Note that in PWG, also literary citation references can be searched for, say 'Spr' as a 'Text Word'
  3. The older search page (http://www.sanskrit-lexicon.uni-koeln.de/scans/PWGScan/disp2/index.php) is not an advanced search page; it is functionally similar to the 'Basic Display' of the 2013 edition. In order for the PWG2013 Advanced Search to have a similar function, it would be needed to have a 3rd search option, say, 'Sanskrit Headword', in addition to 'Sanskrit Word' and 'Text Word'. The data-structure underlying the PWG advanced search could accomodate this, so adding 'Sanskrit Headword' to PWG2013 Advanced Search would be feasible.
  4. You also suggest ordering the Advanced Search results by 'Word Weight'; and you make two suggestions for determining this weight; (a) headword exact match ( at least in case of an 'exact' search), (b) the count of the number of times the match occurs in (the text of) a given headword. Let me show a bit about the data structure underlying the advanced search in PWG. An initialization program reads the full xml for pwg (pwg.xml) and for each record analyzes the file and constructs a 'summary' file which is used by the Advanced Search. Here is the summary file for a headword record for headword aMzahara (HK):
aMSahara :: aMSa   hara <tab> adj einen erbschaftsantheil empfangend erbend p    sch jagn  

Compare this to a display of the underlying data:

(1. aṁśa hara)  adj.  einen Erbschaftsantheil empfangend, erbend,  P. 3, 2, 9, Sch. JĀǴŃ. 2, 132. 133. 

You see that the summary has two parts (separated by the tab): a Sanskrit part and a German (non-Sanskrit) part; and the Sanskrit part itself has two parts, separated by the double colon, comprising the headword as first part and the Sanskrit words in the text as a second part (The Sanskrit parts are in SLP1 transliteration). You'll also see some 'simplifications' of the non-Sanskrit part. When the Advanced search program finds matching records, it goes through this Summary file, line-by-line, looking for matches. Now, you can see that implementing such enhancements as the Word Weight would require a quite different data structure. What real search Engines do in part is to create what I think are called 'inverted indexes' of the records. An inverted index would have something like a list of all Sanskrit words, one per line and an associated list of records where this Sanskrit word appears; ditto for German (or non-Sanskrit). The counts are accomplished in part, I think, by what is called 'faceting'. For example, on the bestbuy.com web site, if you search for Samsung Laptops, you see on the Left 'Computers and tablets (131), etc; this aspect of knowing there are 131 entries for this category is an example of faceting. My main point is that such issues are conceptually non-trivial; I currently do not know how to implement a more robust search engine for the Cologne sanskrit-lexicon dictionaries. The only part of your suggestions I can see as a current feasible enhancement, for PWG, is to add the 'Sanskrit Headword' line as a third search category. Do you think that would be useful to you?

drdhaval2785 commented 9 years ago

@funderburkjim Point 2 and 3 are worth doing.

drdhaval2785 commented 3 years ago

@Shalu411 Do you still have any issue in this regard? Or can we close this issue?

gasyoun commented 3 years ago

@drdhaval2785 let us start using Projects. See bellow:

projects