sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

Feature request: A way to link directly to the dictionary entries #249

Open ghost opened 5 years ago

ghost commented 5 years ago

It would be nice if there was some way to directly link to an entry. Ideally, something of the form, https://www.sanskrit-lexicon.uni-koeln.de/MW/144239 (the last one being the entry's ID) or https://www.sanskrit-lexicon.uni-koeln.de/MW?q=bAQa (the last term being the search term in SLP1) would lead to a page which would list the entries 144239 and 144239.1, i.e., the whole of the following

(H1)  [p= 727,2]  बाढ   mfn.  (or बाळ्ह) (√ बंह्; cf.  Pān v, 3, 63 ) strong, mighty (only ibc. and in बाळ्हे ind.), loudly, strongly, mightily, RV.   [ID=144239]
--
(H1C)   बाढम्   ind. (or वाढम्)  assuredly, certainly, indeed, really, by all means, so be it, yes  (generally used as a particle of consent, affirmation or confirmation), MBh. ; Kāv. &c.   [ID=144239.1]

One would be able to print this too! (That's another feature request, namely, a way to print the entries!)

Thanks.

I had opened this issue at CORRECTIONS by mistake, earlier.

ghost commented 5 years ago

I found this: https://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2013/web/webtc/indexcaller.php?key=bAQa https://www.sanskrit-lexicon.uni-koeln.de/scans/PWGScan/2013/web/webtc/indexcaller.php?key=bAQa

It lists only ID=144239, and not ID=144239.1.

(Ideally, the url would be simple. My favourite German dictionary uses simple urls like https://dwds.de/wb/Wortschatz . When you see the url, you know what exactly it is expected to do.)

gasyoun commented 5 years ago

Ideally, the url would be simple.

Yeah, Jim played with some rewrite rules, as I remember it.

funderburkjim commented 5 years ago

The closest thing currently at Cologne sanskrit-lexicon is shown by this example:

https://www.sanskrit-lexicon.uni-koeln.de/scans/awork/apidev/sample/list-0.2.php?dict=mw&key=hari&input=slp1&output=deva

Maybe this is a starting point. What do you think?

ghost commented 5 years ago

Jim, your link is fine. If you would just shorten the part before the question-mark, it would look fine too. (As the parameters supplied after the question-mark are obvious and bring something of value.)

But to reveal my wish entirely, I'd ideally like to see something like http://woerterbuchnetz.de/cgi-bin/WBNetz/wbgui_py?sigle=DWB Check the first two icons in every entry! The first one lists both the forms that I listed in my top post. And the second allows you to print the article to a pdf file.

This may require a lot of work, though. I just wanted to express exactly what I wanted to see!


When refering others to the Cologne dictionaries' entries, I have to do thing like: "(i) Go to the given link, (ii) Paste the term into the box, (iii) Click Search."

Edit: Actually, there is another step: "(ii)(b) Choose MW as the dictionary"

(Jim: I'd reply to the b-v post later.)

drdhaval2785 commented 5 years ago

There were some discussions earlier regarding RESTful APIs for cologne. Try to see if you can take it a step further and make some code which is production ready by way of rewrite rules? https://github.com/sanskrit-lexicon/Cologne/issues/117 may help you with resources @vniku .

ghost commented 5 years ago

@drdhaval2785 Thanks. I'll go through the page.

A link from drdhaval's link above: https://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/web/webtc/getword.php?key=citra&filter=deva&noLit=off&transLit=slp1 (This one is specially useful as the page can be printed.)

In the earlier post, Jim mentioned the following to get to an image of the exact page: http://www.sanskrit-lexicon.uni-koeln.de/scans/awork/apidev/servepdf.php?dict=MW&key=Davala

funderburkjim commented 5 years ago

print page idea

That's interesting; I never thought of it.

No idea how to accomplish. Is there some Javascript sometihng like:

print_html(html) that prints an html string?

funderburkjim commented 5 years ago

Regarding : http://woerterbuchnetz.de/cgi-bin/WBNetz/wbgui_py?sigle=DWB&mode=Vernetzung&hitlist=&patternlist=&lemid=GK00009#XGK00009

This is a pretty display. Doing something like it for Cologne dictionaries is perhaps possible. For example we have lists of all headwords in all dictionaries (e.g., sanhw1 or hwnorm1c).

It looks like the left panel has words beginning with a given letter.
Suppose I click 'A' - Then left panel goes from 'A' to 'ABDRECHSELN'.
How does a user get to words beginning with 'AD' or 'AS', etc. ?

funderburkjim commented 5 years ago

https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=mw&key=hari&input=slp1&output=deva

This would be one shortening. If someone could research how to do a rewrite rule I could try putting the rule in an '.htaccess' file at Cologne. Is there much interest in doing this?

funderburkjim commented 5 years ago

Regarding access by ID number

This sounds possible, and perhaps not too hard. Calling form such as

https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=mw&id=12345&output=deva

But what are the use cases?

ghost commented 5 years ago

But what are the use cases?

In general, we'd like direct links because

(i) Every word would have a www home of its own! Ideally, it would be a permalink. The words can then be linked to, directly quoted in articles, etc.

(You know, like, "According to Monier-Williams, उर्वशी is derived from the word उरु, meaning 'to pervade', and not from ऊरु, thighs, as the traditional Indian derivation goes.")

(ii) As gasyoun points out somewhere, this is good for SEO too. When people search for sanskrit words or phrases, the search engines would direct them to the Cologne dictionary pages too. (As of now, they don't.)

Accessing by ID number is not required per se, but at least to me, such links generate more confidence about their permanence! (Valuing the links containing phrases less than the links containing IDs may be somewhat misplaced in this case. In general, netizens have learnt that links containing phrases change when, e.g., the title of the page changes, while links containing IDs don't change -- or at least, never needs to change. But I don't see why a link containing the query term in SLP1 ever needs to change.)


I'd like all entries not just to be reachable (using the above link, e.g.), but also, that the links become the primary way to reach a particular word (as opposed to a link to a dictionary interface). To make myself clear: There are two things we want to do: Provide an interface to browse through/search through the dictionary. And, provide a home to a particular word on the internet. The former links (browsing through the dictionary) can be as complex as you please. The latter links (links to particular words) would ideally be as simple and, as I said, as "confidence generating" as possible.

Moreover, ideally, every output, as with the DWB linked above, would itself provide the latter links.


Please note that the link does not change when you make changes in the form (e.g., search for a new word). (You get the new output, but the url stays the same.) You have to edit it by hand to generate the new link.


Unrelated: In the Urvashi MW entry above, there seems to an obvious error in "M.M., Chips". It should likely be "M.M. Chips" (or perhaps, "Chips, M.M."), but I wasn't able to find that name online.

ghost commented 5 years ago

Jim's idea of a unified list above is pretty nice. Searching for the same word, one by one, in many dictionaries is a chore! Let the computer do it, and either list them all at once, or provide handy links to access the same words in other dictionaries. (I would prefer the former, woerterbuchnetz.de does the latter.)

ghost commented 5 years ago

A benefit of an ID link:

The IDs are already listed at the end of every entry. All we need to do it to convert them to a hyperlink.

So,

उर्वशी f. (fr. उरु and √ 1. अश्, ‘to pervade’ See M.M., Chips, vol. ii, p.99), ‘widely extending’, N. of the dawn [...] [ID=37424]

to something like,

उर्वशी f. (fr. उरु and √ 1. अश्, ‘to pervade’ See M.M., Chips, vol. ii, p.99), ‘widely extending’, N. of the dawn [...] [permalink=37424]

@gasyoun or others who know about it, please correct me if I am wrong: I think for the SEO business, once the above links are functioning, all we need to do is to generate all such links (it would be gigantic list of urls), and submit that to the search engines as our "sitemap". (The technical work involved in generating this "sitemap" would be trivial, of course.)

Even so, I'd like even simpler urls, as permalinks. (You can't keep changing the urls, even their parameters, or the search engines would get confused.) Also, when others start linking to the permalinks, the search engine rank of the dictionaries would improve.

Edit: For now, I have got what I wanted: A way to directly link to every entry, and a way to print the page.

SergeA commented 5 years ago

Unrelated: In the Urvashi MW entry above, there seems to an obvious error in "M.M., Chips". It should likely be "M.M. Chips" (or perhaps, "Chips, M.M."), but I wasn't able to find that name online.

The text is correct. This should be added to the source list: M. M., Chips = Chips from German Workshop, by Max Müller (online here - https://archive.org/details/ChipsFromAGermanWorkShopVol2MaxMullerF./page/n115 )

SergeA commented 5 years ago

To have permalinks for MW is a long wished functionality. Is it possible to have an easy way to generate a link for current article and copy it to buffer by clicking one button?

gasyoun commented 5 years ago

Suppose I click 'A' - Then left panel goes from 'A' to 'ABDRECHSELN'.

Yes, but it's not that it shows only 'A' to 'AB'. There are so many articles starting on 'AB', that that's the only thing we see. You can't have there 500 articles per page, only limited set amount.

abb

How does a user get to words beginning with 'AD' or 'AS', etc. ?

There is no easy way, but sure could be. Like the Dhaval code for Reverse dictionary (with 2 letter combinations, but we could have 3 letter combinations as well), an index of word beginnings or endings:

-kā (4349), -khā (208), -gā (241), -ghā (43), -cā (62), -chā (52), -jā (423), -jhā (2), -ñā (79), -ṭā (216), -ṭhā (156), -ḍā (192), -ḻā (2), -ḍhā (43), -ṇā (366), -tā (3420), -thā (376), -dā (387), -dhā (422), -nā (711), -pā (259), -phā (7), -bā (48), -bhā (270), -mā (321), -yā (1890), -rā (1024), -lā (1191), -vā (461), -śā (161), -ṣā (586), -sā (354), -hā (227); -ki (124), -khi (31), -gi (48), -ghi (3), -ci (200), -chi (5), -ji (132), -ñi (7), -ṭi (528), -ṭhi (17), -ḍi (76), -ḻi (2), -ḍhi (25), -ṇi (757), -ti (4836), -thi (227), -di (299), -dhi (1322), -ni (786), -pi (111), -phi (1), -bi (22), -bhi (106), -mi (342), -yi (25), -ri (960), -li (520), -vi (226), -śi (105), -ṣi (164), -si (71), -hi (104); -kī (302), -khī (80), -gī (142), -ghī (7), -cī (107), -chī (8), -jī (74), -jhī (1), -ñī (15), -ṭī (277), -ṭhī (69), -ḍī (170), -ḍhī (11), -ṇī (970), -tī (855), -thī (69), -dī (314), -dhī (139), -nī (1675), -pī (134), -phī (1), -bī (40), -bhī (74), -mī (190), -yī (127), -rī (1558), -lī (957), -vī (372), -śī (195), -ṣī (166), -sī (164), -hī (76); -ku (98), -khu (8), -gu (100), -ghu (10), -ṅu (1), -cu (34), -chu (17), -ju (38), -jhu (1), -ñu (13), -ṭu (111), -ṭhu (12), -ḍu (53), -ḻu (3), -ḍhu (2), -ṇu (198), -tu (534), -thu (46), -du (142), -dhu (187), -nu (381), -pu (54), -phu (3), -bu (79), -bhu (51), -mu (10), -yu (422), -ru (735), -lu (185), -vu (6), -śu (149), -ṣu (337), -su (256), -hu (137); -kū (4), -gū (13), -ghū (1), -cū (3), -chū (4), -jū (23), -ṭū (2), -ḍū (6), -ṇū (4), -tū (11), -thū (3), -dū (10), -dhū (60), -nū (18), -pū (71), -phū (3), -bū (20), -bhū (431), -mū (4), -yū (17), -rū (67), -lū (14), -vū (1), -śū (7), -ṣū (13), -sū (87), -hū (21); -kṛ (564), -gṛ (7), -ghṛ (6), -jṛ (2), -ṭṛ (75), -ṭhṛ (1), -ḍṛ (1), -ḍhṛ (17), -ṇṛ (1), -tṛ (1255), -thṛ (1), -dṛ (8), -dhṛ (68), -nṛ (2), -pṛ (18), -bhṛ (20), -mṛ (20), -yṛ (5), -rṛ (1), -lṛ (1), -vṛ (51), -sṛ (72), -hṛ (76); -kṝ (29), -gṝ (19), -jṝ (6), -jhṝ (1), -tṝ (29), -dṝ (13), -dhṝ (1), -nṝ (1), -pṝ (17), -bṝ (2), -bhṝ (1), -mṝ (4), -vṝ (3), -śṝ (14), -sṝ (1), -hṝ (1); -ke (15), -khe (4), -ge (5), -ce (1), -je (3), -ṭe (5), -ṭhe (3), -ḍe (3), -ḍhe (1), -ṇe (14), -te (48), -the (11), -de (10), -dhe (10), -ne (23), -pe (12), -bhe (1), -me (17), -ye (54), -re (69), -le (31), -ve (84), -śe (10), -ṣe (12), -se (14), -he (13); -kai (1), -khai (2), -gai (16), -cai (1), -jai (1), -tai (1), -thai (1), -dai (3), -nai (2), -pai (1), -mai (1), -yai (56), -rai (15), -lai (7), -vai (25), -śai (1), -ṣai (5), -sai (2), -hai (1); -ko (1), -go (12), -co (3), -cho (6), -jo (3), -ṇo (2), -to (15), -tho (3), -do (13), -dho (3), -no (10), -po (3), -bho (3), -mo (3), -yo (15), -ro (7), -lo (1), -vo (6), -śo (6), -ṣo (8), -so (16), -ho (7); -kau (2), -gau (1), -cau (1), -jau (2), -ṇau (3), -tau (4), -thau (1), -dau (2), -dhau (1), -nau (11), -pau (2), -mau (2), -yau (4), -rau (11), -lau (2), -vau (1), -ṣau (2), -sau (3), -hau (3); -k (237), -kh (57), -g (87), -gh (45), -ṅ (27), -c (633), -ch (68), -j (1073), -jh (5), -ñ (1), -ṭ (194), -ṭh (50), -ḍ (176), -ḻ (6), -ḍh (3), -ṇ (112), -t (4453), -th (117), -d (1597), -dh (351), -n (9438), -p (407), -ph (21), -b (68), -bh (202), -m (2904), -y (125), -r (389), -l (224), -v (228), -ś (582), -ṣ (791), -s (3487), -h (595);

https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=mw&key=hari&input=slp1&output=deva

This would be one shortening. If someone could research how to do a rewrite rule I could try putting the rule in an '.htaccess' file at Cologne. Is there much interest in doing this?

But I don't see why a link containing the query term in SLP1 ever needs to change

Exactly

provide handy links to access the same words in other dictionaries

@vniku - it's time for you to show what you can code. Let us see if you really want that feature or just do not mind to have it.

You've invited vniku to Sanskrit Lexicon! They'll be receiving an email shortly. They can also visit https://github.com/sanskrit-lexicon to accept the invitation.

Regarding access by ID number

Better not, only if we can link to a direct meaning, not the whole article in general.

submit that to the search engines as our "sitemap".

Sure, we split 400k URLs per 50k and get 8 sitemap files and 1 index sitemap file to contain the rest 8.

Is it possible to have an easy way to generate a link for current article and copy it to buffer by clicking one button?

Sure, but that's a different task. If Jim can handle '.htaccess' correctly we will not need to click nothing, it will be there by default.

ghost commented 5 years ago

I have joined the sanskrit-lexicon group. I suggest that we first collect all the ideas and choose the best.

For permalinks: 1) An idea for the simplification of the url https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=mw&key=hari&input=slp1&output=deva

We don't need to specify the input and output format formats -- specially not the output format. Anyone who wants a different input and output format can choose it on the page itself. Not everything needs to be done at once!

Thus, the url can look like https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=mw&key=hari

2) I am repeating myself, but I'd like https://www.sanskrit-lexicon.uni-koeln.de/MW/hari the best. If that does not look like a "logical" structure, then we can have https://www.sanskrit-lexicon.uni-koeln.de/dictionaries/MW/hari

A reason for that apart from one mentioned above (that simple urls "generate confidence" about permanence): I don't know how common this is, but where ever possible, I edit the url itself to reach the desired word. This saves a lot of time and effort. (No need to go to home page, hit the text input box, choose the correct options, hit Enter.) In https://dwds.de/wb/Wortschatz, I just change the last term to reach the new term. There is nothing to remember, no care to take.

For a unified access to all dictionaries: 1) We already have a unified list for headwords. ( sanhw1 or hwnorm1c, and others?)

One way to do this (there may be better ways): Generate a list of dictionaries corresponding to every headword in the unified list. (I'd have to see how list should should be saved for optimal access.)

Headword Dictionaries ... hari: MW MW72 PWG AP harika: MW PWG hariman: MW MW72 PWG ...

This would be our master list. In the unified interface, whenever a headword is queried, by either clicking it in woererbuchnetz.de fashion, or using the search box, we consult our list to find (i) if the word is available, and (ii) if available, which dictionaries to query for. Then we query the dicitionaries, and present our result.

Edit: Regarding access by ID: I hadn't noticed that in MW, not just every headword, but every meaning has its own ID. Direct access to every meaning doesn't look useful.

(I love data, so it looks nice to me. If no use for it has yet turned up, let's just console ourselves that at least it is a nice thing to do, and no doubt some use will turn up someday!)

gasyoun commented 5 years ago

https://www.sanskrit-lexicon.uni-koeln.de/MW/hari

I agree it's the maximum length we can afford. Should hari be in HK? Not all servers have the difference in hari and harI.

One way to do this

Done long ago:

ratimukula:ACC ratiraRaDIra:MW ratiratnapradIpikA:ACC,MW ratiramaRa:CAE,MW,PW,PWG,SCH,SHS,VCP,WIL,YAT ratiramaRaH:SKD

This would be our master list.

Make a github local copy and explore the files mentioned.

ghost commented 5 years ago

Not all servers have the difference in hari and harI

That is not a problem here. All that we want from the server is to pass the GET string unmangled to the program handling the queries. This all servers will do.

It becomes a problem when you are doing system calls. I don't see why hari and harI ever need to become filenames. Links 1 and 2.

(Another factor to consider: SLP1 stands by itself -- no justification or even mention of it is needed. If some other transcription scheme is used, that would have to be prominently mentioned.)

(Edit: By the way, dwds.de/wb/Wortschatz works, but dwds.de/wb/wortschatz does not. In the beginning I thought it was crazy, but I then realized that no native German speaker will ever write a noun without the first letter in uppercase, and so why should the dwds.de people add lines of codes to make the access term case-insensitive and thereby slow down the implementation?)

Done long ago:

Nice! (I wasn't able to find sanhw1, but did find sanhw2 in alternateheadwords, which looks good enough. I haven't checked the code yet.)

funderburkjim commented 5 years ago

M.M., Chips

Problem corrected in two steps:

Add <ls> markup in digitization :

<s>urvaSI</s> ¦ <lex>f.</lex> (<ab>fr.</ab> <s>uru</s> and √ <hom>1.</hom> <s>aS</s>, 
‘to pervade’, see <ls>M.M., Chips, vol. ii, p.99</ls>), ‘widely extending’, 

Install this correction (via instructions of pywork/readme_update.txt for mw)

Add 'M.M.' entry to mwauth.txt

(this is file pywork/mwauth/mwauth.txt)

12:14   M.M.    MMChips ti  <expandNorm><ti>Max Muller, Chips from a German Work Shop</ti> [Cologne Addition]</expandNorm>

install this change via script mwauth/redo.sh

verify

image

Conclude this problem solved.

ghost commented 5 years ago

12:14 M.M. MMChips ti Max Muller, Chips from a German Work Shop [Cologne Addition]

There is a minor error in it. The book's title is, Chips from a German Workshop

(The Internet Archive books' metadata not rarely contains errors. Last I checked, they officially prioritized scanning books over getting the metadata right. Last I checked, they had scanned and put online 6 million books!)

ghost commented 5 years ago

So, found sanhw1 (and a lot more!) in CORRECTIONS. (Wow, that's a lot of work. Users like Sonnetag may never learn it, but the work apart from "faithful digitization", would be very helpful to them too! For one, a unified interface to all the dictionaries would never have been useful without the normalization of headwords.)

For the backend work, so far as I see it, sanhw1.txt contains all we need. We already query the database for a term in a given database (dictionary). This is the garden variety query. Depending on what sanhw1.txt says, all we want is to repeat the same query, changing just the database.

The real work would be the interface for presentation. We need more ideas for this!


Update 1: I noticed sanhw1.txt and sanhw2.txt do not have normalized words. E.g. in the latter

hvAla:CAE;40066,CCS;29705,MW;264862,PW;135787,PWG;117928
hvAlaH:AP;36696

If hvAla is the normalized form, we can merge the above two lines to generate hvAla:CAE;40066,CCS;29705,MW;264862,PW;135787,PWG;117928,AP;36696

Then, when the user enters a query term, we (i) normalize the term, and then (ii) search for it in the above list, (iii) access the entry in the various databases using the id numbers from the above list.

Update 2: Duplicating the lines in the above file would be better than normalizing the query term (as normalizing the query term "live" would take up resources and time). So, the new, unified, file would be like,

hvAla:CAE;40066,CCS;29705,MW;264862,PW;135787,PWG;117928,AP;36696
hvAlaH:CAE;40066,CCS;29705,MW;264862,PW;135787,PWG;117928,AP;36696

sanhw1.txt and sanhw2.txt currently contain 431512 lines. The above unified file would contain the same number of lines.

funderburkjim commented 5 years ago

@vniku For personal reasons, I'm not going to be able to contribute much on this or other projects for a while.

But here's a brief comment regarding normalized headwords: they appear in the hwnorm1c data. hwnorm1c is built from sanhw1; the basic idea is that certain simple algorithms are used to 'normalize' the spelling of headwords from various dictionaries, and then entries corresponding to identically normalized spellings are gathered.

You can see the normalization rules in hwnorm1c.py in function normalize_key.

In my opinion more work needs to be done in this normalization. What I would like is to have a 'standard' spelling for any Sanskrit headword (in any Sanskrit-X dictionary at Cologne.) Probably this spelling would often be that of the MW spelling. More than the current normalization of hwnorm1c is required. For instance, the Wilson dictionary often lists verbs with some form of anubandha; so 'viSa' (slp1 spelling) in WIL corresponds to 'viS' in MW.). And I think there may be differences in spelling of nominal headwords e.g. pitA (nom. sing.) in SKD ~ pitf in MW.)

Although the 'Perma Link' notion mentioned above is still somewhat vague to me, I suspect that a permalink should somehow be tied to normalized spellings, rather than dictionary-specific spellings.

gasyoun commented 5 years ago

Let us leave permalinks tied to normalised spellings as step two, because it might take a few more years (like the anubandhed dhatu entries of Wilson). @artforlife is ready to make the implementation of New URLs, but is waiting for access, @funderburkjim. Let @vniku explore the code on his own now as it will take time. I would rather want to see a sample of code than another request of it, is it possible @vniku?

funderburkjim commented 5 years ago

Chips from a German Workshop

Spelling changed from 'Work Shop' to 'WorkShop'. See urvaSI to confirm. @vniku Thanks for close reading!

ghost commented 5 years ago

@gasyoun: If my plan above looks reasonable, I'll do it. (Give me some time!)

gasyoun commented 5 years ago

If my plan above looks reasonable, I'll do it.

It makes sense to me, if you do it. Found a way to download the local version of dictionary?

ghost commented 5 years ago

Found a way to download the local version of dictionary?

Hmm. Why do I need a dictionary? My first step would be to generate two files using the normalization rules: a normalized sanhw2.txt (with reduced number of lines), and another file derived from sanhw2 with duplicated dictionary data (with the same number of lines).

drdhaval2785 commented 5 years ago

Jim may shed more light, but it seems that you need sqlite3 for local version to work I guess.

gasyoun commented 5 years ago

a normalized sanhw2.txt sounds like a sanhw3.txt

udvega:AP90;8306,AP;9232,BEN;2178,BOP;1367,BUR;3033,CAE;5629,CCS;3504,MD;4612,MW72;11080,MW;33647,PW;19239,PWG;11543,SCH;8443,SHS;7204,STC;6523,VCP;9345,WIL;7207,YAT;6612 udvegaM:SKD;4806 udvegaH:SKD;4807

If you search for udvega, udvegaM or udvegaH - you get all of them, right?

drdhaval2785 commented 5 years ago

https://github.com/sanskrit-lexicon/hwnorm1/blob/master/ejf/hwnorm1c/hwnorm1c.txt not sufficient @vniku?

artforlife commented 5 years ago

@drdhaval2785 Do we have any docs on setting up the database? After downloading the MW, I see a bunch of .TXT files which seem to contain the dictionary entries. However, I have yet to find a database. Is there a script that pulls the data from TXT format to a database?

drdhaval2785 commented 5 years ago

redo.sh or something like that. Look for ig.

ghost commented 5 years ago

hwnorm1c: Right. Thanks. Using it (with minor edits), I got a file with 384887 lines like this:

udvega:AP90;8306,AP;9232,BEN;2178,BOP;1367,BUR;3033,CAE;5626,CCS;3504,MD;4612,MW72;11082,MW;33647,PW;19239,PWG;11543,SCH;8443,SHS;7204,STC;6523,VCP;9345,WIL;7207,YAT;6612/udvegaM:SKD;4806/udvegaH:SKD;4807

(40335 lines in it contain two or more records)

This has lines of the form key1a:value1a/key1b:value1b/key1c:value1c key2a:value2a key3a:value3a/key3b:value3b

(where value1a is of the form DICT1;id,DICT2;id,etc)

All I want is to "expand" them back, now, to key1a:value1a,value1b,value1c key1b:value1a,value1b,value1c key1c:value1a,value1b,value1c key2a:value2a key3a:value3a,value3b key3a:value3a,value3b

(Starting from sanhw2.txt, we get a list of DICT;id form of values for every query, which has all the information we need. If we were to work on sanhw1.txt, we'd need to build a list of DICT;queryterm values. Accessing the dictionaries using the ids must be faster.)

gasyoun commented 5 years ago

Using it (with minor edits)

@drdhaval2785 did you understood?

drdhaval2785 commented 5 years ago

I understand what he says. I am yet to understand what he wants to achieve by this.

ghost commented 5 years ago

Please check this repository.

My plan is this:

(a) We have a file (the siblingsduplicated.txt) which has all the available headwords (non-normalized, 431512 entries), and the corresponding list of dictionary;id entries.

(b) For every query term, we read just one line from this file. From it, we get not just the entries for that word, but also, the entries for all its "siblings". (By "siblings", I mean all the words which will reduce to the same normalized form.)

(The above mentioned file would probably have to be sorted, and converted to an sqlite database for faster access.)

(c) Using the list of dictionary;id data we get in the above step, we construct all the sql queries we need to perform.

Update: Please check the repository again. I've made the filenames saner, and added a new file, siblings.txt. (This file has information in the format, NormalizedHeadword:AllSiblings:AllDictionaries. If we are to first (i) reduce the search term to its normalized form, and then, (ii) search for the normalized form, we'd use the first and third columns of this file.)

(I'll made more edits in the files and the code for a few days.)

ghost commented 5 years ago

Next steps:

Let's wait for a while for Jim to be back. Then, I'd like Jim to understand my "vision", and then for Jim and others to provide help.

drdhaval2785 commented 5 years ago

Regarding interface and presentation of multiple dictionaries, I like tab like presentation of EBDic most.

drdhaval2785 commented 5 years ago

screenshot_20190212-073756_ebdic

drdhaval2785 commented 5 years ago

These ways the user does not need to scroll down a lot. He can pick his required dictionary. He can click directly on tab of dictionary to land there. The order in the tab can also specified by cookies.

gasyoun commented 5 years ago

These ways the user does not need to scroll down a lot.

On your screenshot 11 tabs are there, but we have 3 times more dictionaries.

Scrolling is not an issue on desktop. Not sure about mobile. Anyway, @vniku, I would want to see some attempts to code, not just asking questions. As you can see @artforlife has made a local version of the site and is testing it - wish you do the same. That's the blessing - see how it works first.

ghost commented 5 years ago

@gasyoun If dict;id access works, getting a unified output is basically as simple as constructing the list of appropriate queries. A lot of work would be required in constructing a good interface, but that comes later.

E.g., for the query term aMSaBU the line in siblingsduplicated-id.txt says:

aMSaBU:MW;26/PD;159/PW;14/PWG;62408

We just construct the following urls (or some simplified way of making the following queries).

https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=MW&id=26 https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=PD&id=159 https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=PW&id=14 https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=PWG&id=62408

Update: I have added a few more files. Now, we have lines like the following in siblingsduplicated-hw.txt aMSaBU:MW;aMSaBU/PD;aMSaBU/PW;aMSaBU/PWG;aMSaBU

So, we can construct urls(or php sqlite queries) with the existing methods:

https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=MW&id=aMSaBU https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=PD&id=aMSaBU https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=PW&id=aMSaBU https://www.sanskrit-lexicon.uni-koeln.de/listview?dict=PWG&id=aMSaBU

But, @funderburkjim knows the code and would know exactly what to do! I would rather wait for him than to start working on the php code myself!

Summary: On getting a query term, we read a line from either siblingsduplicated-id.txt or siblingsduplicated-hw.txt. We get a list of all dictionary;id or dictionary;dictionaryheadword at once, from one line. We then craft all the appropriate url/sql queries.

drdhaval2785 commented 5 years ago

https://www.sanskrit-lexicon.uni-koeln.de/dicts/MW/ids/26

https://www.sanskrit-lexicon.uni-koeln.de/dicts/MW/entries/rAma

Seem ok?

gasyoun commented 5 years ago

Seem ok?

As bad as can be. Too long without no need at all.

drdhaval2785 commented 5 years ago

Then you must write API URL Marcis. https://api.github.com/repos/drdhaval2785/siddhantakaumudi seems to work almost the same way.

Maybe shortened by one entry. https://www.sanskrit-lexicon.uni-koeln.de/ids/MW/26

https://www.sanskrit-lexicon.uni-koeln.de/entries/MW/rAma

drdhaval2785 commented 5 years ago

Unified one - https://www.sanskrit-lexicon.uni-koeln.de/rAma

Dictwise - https://www.sanskrit-lexicon.uni-koeln.de/MW/rAma

Idwise - https://www.sanskrit-lexicon.uni-koeln.de/MW/26

Last two are a bit clumsy APIwise. The code will parse whether the first is a number or alphabet. But looking at the insistence for shorter URLs, there does not seem to be any other way out than sacrificing some good practice.

drdhaval2785 commented 5 years ago

Images entrywise - https://www.sanskrit-lexicon.uni-koeln.de/images/MW/rAma

Images idwise - https://www.sanskrit-lexicon.uni-koeln.de/images/MW/26

gasyoun commented 5 years ago

Maybe shortened by one entry. https://www.sanskrit-lexicon.uni-koeln.de/ids/MW/26

Every letter counts.

Unified one - https://www.sanskrit-lexicon.uni-koeln.de/rAma Dictwise - https://www.sanskrit-lexicon.uni-koeln.de/MW/rAma Idwise - https://www.sanskrit-lexicon.uni-koeln.de/MW/26

makes sense, @artforlife - he is waiting for the test server from you, Dhaval and was asking for it again today. He is ready to play.

Images entrywise - https://www.sanskrit-lexicon.uni-koeln.de/images/MW/rAma

Rather https://www.sanskrit-lexicon.uni-koeln.de/img/MW/rAma

drdhaval2785 commented 5 years ago

https://www.sanskrit-lexicon.uni-koeln.de/imgs/MW/rAma

imgs instead of img. It needs to be plural according to RESTful API standards.

I am not able to access test server from credentials provided to me earlier by Jim. Not sure why. Maybe Jim will have to create new credentials for me and Yakov.

But as far as I remember, there was no need of test server for writing rewrite rules. We can test it directly from our localhost. We can rewrite the limks from our localhost to the live cologne server