sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

RESTful APIs + Clean URLs for cologne #117

Open drdhaval2785 opened 7 years ago

drdhaval2785 commented 7 years ago

https://api.github.com/repos/drdhaval2785/siddhantakaumudi

Please look at the API structure of various links mentioned on this page. It is extremely intuitive. Can we design similar APIs for cologne where the URL itself sends necessary information and we send back the response JSON?

UPDATE

https://github.com/sanskrit-lexicon/Cologne/blob/master/api/apidoc.md is the place where I will be tracking various existing Cologne APIs and their rewrite rules.

funderburkjim commented 7 years ago

These urls are of a style that I have seen with various Python web frameworks, such as Django and flask.

http://example.com/value1/value2/value3

It may also be a feature of Ruby on Rails web framework, which may be what runs GitHub.

It is probably an art to making such urls 'intuitive' to the user of a website.

Essentially, the servers for these urls send the url to a routing function, which interprets the sequence of values and acts accordingly.

Another style of restful api is to use parameters:

http://example.com?parm1=value1&parm2=value2&parm3=value3

This is the style that I know how to implement with php; and all the restful interfaces at Cologne sanskrit-lexicon site are of this form.

It might be that using Apache rewrite rules (e.g. by an '.htaccess file reference) would be a way to turn the first style into the second style.

Or maybe (probably) there is some way that php has to directly handle urls of the first type. Maybe you can do some research and find a reference on this.

The sending back of response in JSON is not particularly hard, whichever style of url is used. Since JSON definition includes not only Objects {'x':'y', 'z':'w'}, arrays [x1, x2, x3], but also constants 'X', 274, etc ==== We can say that Cologne restful apis already return JSON: namely, in the form (for the most part) of string constants representing HTML (usually).

Still a third way of sending parameters is to send JSON to the server, e.g.

{"parm1":"value1", "parm2":"value2"...}

This is easy with JQuery .ajax

If I'm not mistaken, this latter JSON way of sending from browser to server is the form @juhnowski
favors.

Probaby the first thing that needs to be done is to have an inventory of all the restful (in style 2) interfaces that currently exist at Cologne; and a specification of the type of data returned. This would serve as a reference point for designing a better API, for whichever restful style we move towards. This will also give a basis for identifying what kinds of JSON objects (rather than just html string constants) the server should return in response to the restful inputs.

drdhaval2785 commented 7 years ago

Just to get discussion started on two APIs, I have scribbled two such itens

  1. Suggest word based on prefix, suffix, substring

tmp_394-20170415_1024422124857691

drdhaval2785 commented 7 years ago
  1. Search a particular entry

entries/dictcode/inputtransliteration/headword/outputtransliteration/ignoreaccent

tmp_394-20170415_102451337383365

gasyoun commented 7 years ago

This is the style that I know how to implement with php; and all the restful interfaces at Cologne sanskrit-lexicon site are of this form.

From SEO - it's worst possible way. And for the user it's no good as well, right.

Or maybe (probably) there is some way that php has to directly handle urls of the first type. Maybe you can do some research and find a reference on this.

Do not think so. I usually do with mod rewrite. But I do not deal with APIs.

drdhaval2785 commented 7 years ago

Regarding implementation of point 1

.htaccess file

RewriteEngine on
RewriteRule ^suggest/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*) http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc2/query.php?word=&lastLnum=0&max=$5&filter=deva&regexp=exact&scase=true&sword=$4&sregexp=$2&transLit=$3&outopt=outopt4&swordhw=hwonly

http://localhost/apitrial/suggest/PW/prefix/slp1/Dava/100 gave result like this

-1#

1 धवनी
2 धवर
3 धवल
4 धवलगिरि
5 धवलगृह
6 धवलता
7 धवलनिबन्ध
8 धवलपक्ष
9 धवलपुराणसमुच्चय
10 धवलमुख
11 धवलमृत्तिका
12 धवलय्
13 धवलाय्
14 धवलाष्टक
15 धवलिमन्
16 धवली
17 धवलेतरतण्डुल
18 धवलोत्पल

@funderburkjim The only drawback seems to be the capital PW in the whole url. It is a bit difficult to modify it to lowercase in rewrite rule. Jim can handle it in backend. The dictionary code passed in small letters can be converted to capital on his part in PHP.

drdhaval2785 commented 7 years ago

So it seems easily doable to use more user friendly URLs with existing infrastructure (with very less modification too). Two lines of .htaccess file is the only code I wrote for this result to occur.

drdhaval2785 commented 7 years ago

Regarding implementation of API no. 2 -

.htaccess file

RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*) http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$3&filter=$4&noLit=off&transLit=$2

http://localhost/apitrial/entries/PW/slp1/citra/deva/ignoreaccent

gave following result

 चित्र

 चित्र [L=39963] [p= 2228-1]
— 1)  Adj.  ( f.  )
  — a) augenfällig , sichtbar , ausgezeichnet.
  — b) hell , licht , hellfarbig. °म्  Adv.   Ṛv.1,71,1.6,65,2. 
  — c) hell , vernehmlich (von Tränen).  [Page2.228-2] 
  — d) verschiedenfarbig , bunt , scheckig. Das , was die Verschiedenfarbigkeit bildet , im  Instr.  oder im Comp. vorangehend.
  — e) bewegt (vom Meere).
  — f) mannichfaltig , verschieden , allerlei. °म् und चित्र°  Adv. 
  — g) qualificirt , mit verschiedenen Martern verbunden (Strafe , Hinrichtung)  203,26.  °म्  Adv.  unter verschiedenen Martern.
  — h) wunderbar.  Spr.5087. 
  — i) das Wort चित्र enthaltend.
— 2)  m. 
  — a) *Buntheit.
  — b) ®*Plumbago_zeylanica.
  — c) ®*Ricinus_communis.
  — d) ®*Jonesia_Asoka.
  — e) eine Form Jama's.
  — f) N.pr.
    — α) verschiedener Männer (parox.  Ṛv.). 
    — β) *eines Gandharva  Gal. 
— 3)  f. 
  — a)  Sg.  und  Pl.  das 12te (später das 14te) Mondhaus.
  — b) *eine Schlangenart.
  — c) Bez. verschiedener Pflanzen  Ḱaraka.7,12.  ( = द्रवन्ती). Nach den Lexicographen: ®Salvinia_cucullata , ®Cucumis_maderaspatanus , Koloquinthe , ®Ricinus_communis , ®Croton_polyandrum oder Tiglium , Myrobalanenbaum , ®Rubia_Munjista und ein best. Gras ( गण्डदूर्वा).
  — d) Bez. verschiedener Metra.
  — e) ein best. Saiteninstrument  S.s.s.185. 
  — f) ein best. Mûrḱhanâ  S.s.s.30. 
  — g) *Schein , Täuschung.
  — h) N.pr.
    — α) *einer Apsaras.
    — β) verschiedener Frauen.
    — γ) eines Felsens.
    — δ) *eines Flusses.
— 4)  n. 
  — a) eine helle , glänzende oder farbige Erscheinung , ein in die Augen fallender Gegenstand , ein funkelndes Geschmeide , Schmuck.
  — b) *verschiedenfarbiges oder verschiedengestaltetes Gehölz.
  — c) Fleck , macula.
  — d) *Sectenzeichen auf der Stirn.
  — e) *der weisse Aussatz.
  — f) Bild , Gemälde , Malerei. Am Ende eines  adj.  Comp.  f.   Megh.64. 
  — g) *buntheit.
  — h) eine ungewöhnliche Erscheinung , Wunder. Mit folgenden यदि , यद् oder *Fut. चित्रम् als Ausruf so v.a. o. Wunder  123,22.134,29.174,11.  Spr.7811. 
  — i) *der Luftraum , Himmel.
  — k) Bez. verschiedener Arten , künstliche Verse u.s.w. in Form von allerlei Figuren durch Nichtwiederholung wiederkehrender Silben oder Wörter in abgekürzter Weise künstlich für das Auge darzustellen  Kâvjapr.9,8.  Wort- und Lautspiel.
drdhaval2785 commented 7 years ago

@funderburkjim and @gasyoun

After experimenting a bit with Cologne server APIs, I feel that the work towards RESTful APIs + Clean URLs is just a bit of mod rewrite modules + some regex magic. So it is time to define all parameters and output format for proper APIs.

Jim may like to list the current APIs (Long URL types). I will convert it to some user friendly APIs via rewrite.

drdhaval2785 commented 7 years ago

It might be that using Apache rewrite rules (e.g. by an '.htaccess file reference) would be a way to turn the first style into the second style.

And it turns out to be damn easy.

Server needs these two commands to enable mod rewrite

a2enmod rewrite
service apache2 restart

Then the following .htaccess file needs to be put in api folder in Cologne server.

RewriteEngine on
RewriteRule ^suggest/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*) http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc2/query.php?word=&lastLnum=0&max=$5&filter=deva&regexp=exact&scase=true&sword=$4&sregexp=$2&transLit=$3&outopt=outopt4&swordhw=hwonly
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*) http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$3&filter=$4&noLit=off&transLit=$2
gasyoun commented 7 years ago

http://localhost/apitrial/entries/PW/slp1/citra/deva/ignoreaccent

I would go for http://www.sanskrit-lexicon.uni-koeln.de/MD/citra/SDI/ instead of http://www.sanskrit-lexicon.uni-koeln.de/scans/MDScan/2014/web/webtc/indexcaller.php

/SDI/

These service pages we will close for indexation. S for SLP1, D for devanagari, I for ignore accents.

drdhaval2785 commented 7 years ago

I would go for http://www.sanskrit-lexicon.uni-koeln.de/MD/citra/SDI/

Doable, but seems non-intuitive.

Currently making a list of existing APIs at Cologne.

funderburkjim commented 7 years ago

@drdhaval2785 Good research on htaccess.

I tried one of your examples at Cologne as follows:

RewriteEngine On    # Turn on the rewriting engine
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*) http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$3&filter=$4&noLit=off&transLit=$2```

Usage example: http://www.sanskrit-lexicon.uni-koeln.de/apitest/entries/PW/slp1/citra/deva/ignoreaccent

This shows:

drdhaval2785 commented 7 years ago

https://github.com/sanskrit-lexicon/Cologne/tree/master/api

This houses various documentations I have started regarding Clean URL development. You can copy paste rewrite rules from there and see whether it rolls out well.

funderburkjim commented 7 years ago

Here is a way to get the whole url to be preprocessed by a php program ---

.htaccess

RewriteEngine On    # Turn on the rewriting engine
RewriteRule ^api/(.*)$ http://www.sanskrit-lexicon.uni-koeln.de/apitest/index.php?parms=$1

php program apitest/index.php

<?php
/* Example from stackoverflow
http://stackoverflow.com/questions/6768793/get-the-full-url-in-php
$url =  "//{$_SERVER['HTTP_HOST']}{$_SERVER['REQUEST_URI']}";
*/
$host = $_SERVER['HTTP_HOST'];
$uri = $_SERVER['REQUEST_URI'];
$parmstring = preg_replace('/.*?parms=/','',$uri);
$parms = explode('/',$parmstring);
//

list($display,$dict,$input,$key1,$output,$accentcode) = $parms;
$year = '2014';
$dictup = strtoupper($dict);

if ($accentcode == 'ignoreaccent') {
 $accent = 'off';
}else {
 $accent = 'on';
}
$newurl = sprintf("http://www.sanskrit-lexicon.uni-koeln.de/scans/%sScan/%s/web/webtc/getword.php?key=%s&filter=%s&noLit=off&transLit=%s",$dictup,$year,$key1,$output,$input);
// redirect. THIS MUST BE FIRST OUTPUT
header('Location:'.$newurl);
//displayinfo($host,$uri,$parmstring,$parms,$newurl);
function displayinfo($host,$uri,$parmstring,$parms,$newurl) {
echo  "HTTP_HOST=$host<br/>REQUEST_URI=$uri<br/>";
echo "parmstring=$parmstring<br/>";
for($i=0;$i<count($parms);$i++) {
 $val = $parms[$i];
 echo "parms[$i]=$val<br/>";
}
echo "newurl=$newurl<br/>";
}

?>

calling sequence

http://www.sanskrit-lexicon.uni-koeln.de/apitest/api/entries/PW/slp1/citra/deva/ignoreaccent

or lower case pw
http://www.sanskrit-lexicon.uni-koeln.de/apitest/api/entries/pw/slp1/citra/deva/ignoreaccent

This index.php program could probably be quite elaborate.

funderburkjim commented 7 years ago

Address-bar problem

One thing that is undesireable in these approaches is that the browser address-bar gets changed to the ?x=y&z=w form --- In other words, the original desired calling sequence gets clobbered.

e.g., for the first example, the address bar changes to the rewritten form:

http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/web/webtc/getword.php?key=citra&filter=deva&noLit=off&transLit=slp1

Is there some remedy for this?

funderburkjim commented 7 years ago

comment on the input parameters

gasyoun commented 7 years ago

Doable, but seems non-intuitive.

At least it's short and SEO is kept in mind.

Because

http://www.sanskrit-lexicon.uni-koeln.de/apitest/api/entries/pw/slp1/citra/deva/ignoreaccent

8 levels deep - we'll have indexation issues. I would want people to find not only the main page, but to find us googling for words as well.

Is there some remedy for this?

Did not get the question? Get rid of ? or what?

drdhaval2785 commented 7 years ago

Address-bar problem

Tried to do some research.

  1. Enable proxy module on apache2

    1. sudo a2enmod proxy_http
    2. sudo a2enmod proxy
    3. sudo service apache2 restart
  2. Add [P] flag at the end

At the end of it all, .htaccess reads like this

Options +FollowSymLinks -MultiViews
RewriteEngine On
RewriteBase /
RewriteEngine on
RewriteRule ^suggest/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*) http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc2/query.php?word=&lastLnum=0&max=$5&filter=deva&regexp=exact&scase=true&sword=$4&sregexp=$2&transLit=$3&outopt=outopt4&swordhw=hwonly [P]
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]+)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=$4&noLit=off&accent=$5&transLit=$3 [P]
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]+)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=$4&noLit=off&accent=$5&transLit=$3 [P]
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=$4&noLit=off&accent=no&transLit=$3 [P]
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=$4&noLit=off&accent=no&transLit=$3 [P]
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=deva&noLit=off&accent=no&transLit=$3 [P]
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=deva&noLit=off&accent=no&transLit=$3 [P]
RewriteRule ^entries/([^/]*)/([^/]*)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=deva&noLit=off&accent=no&transLit=slp1 [P]
RewriteRule ^entries/([^/]*)/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=deva&noLit=off&accent=no&transLit=slp1 [P]
RewriteRule ^pdf/([^/]*)/word/([^/]*)$/ http://www.sanskrit-lexicon.uni-koeln.de/scans/awork/apidev/servepdf.php?dict=$1&key=$2 [P]
RewriteRule ^pdf/([^/]*)/word/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/awork/apidev/servepdf.php?dict=$1&key=$2 [P]
RewriteRule ^pdf/([^/]*)/page/([^/]*)$/ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/servepdf.php?page=$2 [P]
RewriteRule ^pdf/([^/]*)/page/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/servepdf.php?page=$2 [P]
RewriteRule ^list/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]+)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=$3&serverOptions=$4&accent=$5&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]+)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=$3&serverOptions=$4&accent=$5&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)/([^/]*)/([^/]*)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=$3&serverOptions=$4&accent=no&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)/([^/]*)/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=$3&serverOptions=$4&accent=no&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)/([^/]*)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=$3&serverOptions=deva&accent=no&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=$3&serverOptions=deva&accent=no&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=slp1&serverOptions=deva&accent=no&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=slp1&serverOptions=deva&accent=no&viewAs=phonetic [P]
vvasuki commented 7 years ago

This is a splendid project, gentlemen! @drdhaval2785 informed me about it on the https://groups.google.com/forum/#!topic/sanskrit-programmers/wdhMuXGpc1E thread, where I announced a similar API for all the dicts available with the stardict-sanskrit etc. projects.

Some suggestions from my experience (see links in thread above) is that in terms of rapid development and maintainability (without sacrificing any of the frontend-backend separation):

That apart, some other feedback:

PS:

drdhaval2785 commented 7 years ago

couchdb

some starters may help.

drdhaval2785 commented 7 years ago

but you will have to compropmise on the nice intuitive api structure

Not possible to have best of both worlds?

@drdhaval2785 , http://www.sanskrit-lexicon.uni-koeln.de/apitest/entries/PW/slp1/citra/deva/ignoreaccent is redirecting to some other page which is not returning a JSON result.

It is not yet configured to return JSON yet. I have currently piggybacked by rewrites on present Cologne scripts (which are made to return HTML). SO the documentation is ready. Rewrites are ready. Minor modifications on backend script or a version thereof is pending which returns JSON instead of HTML. That part is majorly Jim's lookout.

vvasuki commented 7 years ago

some starters may help.

@drdhaval2785 http://docs.couchdb.org/en/2.0.0/contents.html has a good intro. I found it quite simple to setup and use.

(One another thing I like about couchdb is that you can replicate the db in couchbase-lite db files for offline use is mobiles etc..)

Not possible to have best of both worlds?

Certainly yes. Start with couchdb, write a very thin wrapper to translate the nice API you desire to the couchdb API while interacting with it. You might want to do it anyway to provide support for querying over ssl (ie https - couchdb2 is currently buggy there).

mbykov commented 7 years ago

2017-06-04 18:49 GMT+03:00 Vishvas Vasuki notifications@github.com:

some starters may help.

@drdhaval2785 https://github.com/drdhaval2785 http://docs.couchdb.org/en/2.0.0/contents.html has a good intro. I found it quite simple to setup and use.

(One another thing I like about couchdb is that you can replicate the db in couchbase-lite db files for offline use is mobiles etc..)

Not possible to have best of both worlds?

Certainly yes. Start with couchdb, write a very thin wrapper to translate the nice API you desire to the couchdb API while interacting with it. You might want to do it anyway to provide support for querying over ssl (ie https - couchdb2 is currently buggy there).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/Cologne/issues/117#issuecomment-306048357, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC44z0WDOEAUa0QRswwgcRuDafthFQ4ks5sAtIWgaJpZM4M-OkH .

I have used CouchDB / PouchDB for years, its splendid

https://github.com/mbykov

-- М.

http://diglossa.ru xmpp://m.bykov@jabber.ru

vvasuki commented 7 years ago

@mbykov , great to know! will follow up on your issues pages with questions.

Others, please look at http://diglossa.org:5984/_utils/index.html (meant for database managers/ developers, not end users) for the database UI from an earlier couchdb version to get a feel for it.

vvasuki commented 7 years ago

Another comment as you proceed: since you're doing a major rewrite, you might consider switching away from php to - say python or scala if you like. I've written a python web service with flask_restplus with only a few lines of code and found it very useful. You easily get stuff like self-documenting api:

image

gasyoun commented 7 years ago

consider switching away from php to - say python

Too much attention to UI and backend will kill the dictionary cleanup. If only @mbykov could help us move in the CouchDB direction that @vvasuki proposed.

mbykov commented 7 years ago

2017-06-04 22:53 GMT+03:00 Mārcis Gasūns notifications@github.com:

consider switching away from php to - say python

Too much attention to UI and backend will kill the dictionary cleanup. If only @mbykov https://github.com/mbykov could help us move in the CouchDB direction that @vvasuki https://github.com/vvasuki proposed.

yes, these are different tasks, its should be separated.

I can do that a bit later - after a week or two - I'll write here

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/Cologne/issues/117#issuecomment-306062893, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC44y3LXdZteEdJOP1XVB-LcvMO7PQdks5sAwsfgaJpZM4M-OkH .

-- М.

http://diglossa.ru xmpp://m.bykov@jabber.ru

funderburkjim commented 7 years ago

@vvasuki and @mbykov

As I see it, it's best to do development work with couchdb on a separate server.

We have scripts that allow easy duplication of much of current cologne environment elsewhere.

One or both of you should take the lead in this. I think my time is best spent in the current task of 'normalizing' the Cologne data. Developing JSON forms of the data require an intimate familiarity with the dictionary data, and Dhaval and I can provide scripts to generate JSON forms that you require, once the details of the requirement are clear.

As to programming language, my vote would be for Python. Maybe Python 3, since Python Foundation has stated Python 2 development will stop in 2020. Note the Cologne environment only has Python 2.6 currently.

I would suggest setting up a server on DigitalOcean, just for the purpose of carrying forth the CouchDB ideas both of you are suggesting. This could be done so that all interested parties have access. I'd be glad to do this in whatever way would be most conducive for the development of this idea.

Incidentally, I've done some investigation of ElasticSearch as a good backend; since the search capabilities of Lucene could be brought to bear. Do you have any thoughts on how the benefits of CouchDB compare to those of ElasticSearch?

Let me know if the DigitalOcean idea seems good to you, and if so, we can start developing the details of what is needed.

vvasuki commented 7 years ago

As to programming language, my vote would be for Python. Maybe Python 3, since Python Foundation has stated Python 2 development will stop in 2020.

Good choice - especially since its more familiar to more folks involved compared to php etc..

Incidentally, I've done some investigation of ElasticSearch as a good backend; since the search capabilities of Lucene could be brought to bear. Do you have any thoughts on how the benefits of CouchDB compare to those of ElasticSearch?

advantages for couchdb -

advantages of elasticsearch -

Nothing stops us from having both. We could start with couchdb and add elasticsearch when the time comes to implement full-text querying.

I think my time is best spent in the current task of 'normalizing' the Cologne data. Developing JSON forms of the data require an intimate familiarity with the dictionary data, and Dhaval and I can provide scripts to generate JSON forms that you require, once the details of the requirement are clear.

Good - that's correct. This is infact almost all of what needs to be done (as far as the backend is concerned). Reading a list of JSON documents corresponding to each entry from files and dumping them in a database is relatively easier - @mbykov or I can help you with that depending on our availability when the json files are ready.

As I see it, it's best to do development work with couchdb on a separate server.

Indeed - I always work with a couchdb server installed on my laptop. Once the database is ready in the local server, one can replicate it in a remote server with a few clicks and a long wait. So, you don't really need to set up any server for us except what you'll finally use in production.

vvasuki commented 7 years ago

Regarding the JSON format, I request you to add another field: "correction", where dictionary blunders such as https://groups.google.com/forum/#!topic/bvparishat/ntuaembSOsg can be noted and displayed with citation, so that users don't waste their cycles being misled.. It's easy to publish a google form to accept such inputs.

gasyoun commented 7 years ago

add another field: "correction",

Makes sense as an optional field for every meaning, not just whole word.

funderburkjim commented 7 years ago

It's beyond my competence to evaluate the comment on pUjana mentioned in the above link. But let's assume that the comment is correct, and that the MW dictionary entry for pUjana is misleading. How do we bring this information into the mw.xml digitization?

The displays already have a link to a Correction form, so that is one way for a random user to bring such insights to our attention.

We have a classification of errors that we call 'print errors', which documents where we have intentionally changed the digitization to be different than the printed edition. We keep a log of these for each dictionary; for instance, there is mw_printchange. This mw_printchange file might be one candidate for where to include such scholarly comments pertaining to a given word as in the pUjana example.

Currently, we have no linking mechanism between the dictionary database and this ancillary printchange file. A <correction/> flag or <correction href="mw_printchange.txt">FURTHER DISCUSSION</correction> indicating the existence of an entry in mw_printchange could be placed in the mw.xml record for pUjana, with the assumption that there is a discussion available that may be accessed by a link to the printchange file.

This then outlines a partial solution that is relatively 'near' to the current configuration of the Cologne digitizations.

But @vvasuki 's comment is provocative -- it suggests many possibilities beyond this partial solution. For instance, wouldn't it be good if there were a Sanskrit language StackExchange? Then when a Cologne (or other) dictionary display of pUjana was generated, the display could poll the Sanskrit language Stackexchange API to see if there was discussion there of pUjana, and generate a link there if such a discussion were found.

funderburkjim commented 7 years ago

structural summary of Cologne digitizations

This comment is directed to @vvasuki and @mbykov . It provides a succinct overview of the data structures involved in the cologne digitizations.

Each of the dictionary digitizations is separate, and identified by a dictionary code (mw, pwg, bur, vcp, etc.) Let 'xxx' denote one of these codes.

xxx.txt

There is a primary form of the dictionary , xxx.txt. This is a hybrid form, based closely on the original digitization from Thomas Malten. It is the form to which 'corrections' are made. We are currently in the process of making this form more regular, and similar among dictionaries.

xxx.xml

From this primary form, we generate an xml form, xxx.xml. From xxx.xml, we generate in a completely regular way a sqlite3 database xxx.sqlite; This is the database on which all the displays depend. (NOTE: the 'Advanced search' display also depends on a separate file generated from xxx.xml, in order to facilitate full text searches. This is a substitute for an inverted index, with one advantage of permitting substring searches.)

xxx.sqlite

The xxx.sqlite structure is quite simple - a table with three columns; a row corresponds to an entry in the dictionary.

xxx.dtd

The xxx.xml record itself has some quite regular parts and some irregular parts. The regular parts are the same for all dictionaries; the irregular parts are currently less regular. There is a document type definition file xxx.dtd to which the xxx.xml file validates.

The root of the xxx.xml file is <xxx>. It's easiest to describe the general structure in DTD terms:

<!ELEMENT  xxx (H1)*>     <!-- for MW, there are also H2, H3, H4, H1A, etc. -->
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts  ... >   <!-- This is the variable part. It is a markup of the text of the entry -->
<!-- h element : 'h' for 'head' -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->  <!-- same as sqlite key1 field -->
<!ELEMENT key2 (#PCDATA )><!--  often in slp1-->
<!ELEMENT hom (#PCDATA) >  <!-- homonym identifier - optional -->
<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc | ETC?)*>  
<!ELEMENT L (#PCDATA) >  <!-- same as sqlite L field -->
<!ELEMENT pc (#PCDATA) > <!-- page-column information (for links to scanned images -->
<!-- ETC? possibly some other 'meta' elements, e.g. pertaining to alternate headword spellings -->

The difficult body element

One key element within the <body> element is the <s> element, which is used to identify text appearing in Devanagari in the printed text; the textual contents of such Sanskrit text is coded in SLP1 transliteration.

It seems likely that most of the fields of a JSON form of the data would be quite closely derived from the 'H1', 'h', and 'tail' elements and that the 'body' field would retain its xml structure.

Of course, views of the data currently require interpretation of the xml within the <body>.
In the current Cologne configuration, a view is generated as html (by a php program disp.php) and the correct viewing in a browser depends on this html in conjunction with CSS. The generation of this view also depends on a choice of how to represent the Sanskrit text (<s> element) - Devanagari, IAST, HK, SLP1, ITRANS.

This view (disp.php) currently has some details peculiar to the dictionary. We hope to regularize the xml structure so that one DTD governs all the xxx.xml files. Then, not only could one php program generate a reasonable html display for any dictionary. But also, a python or Javascript program could also generate the same html display. And, other views (such as that for stardict forms) or simple text forms, or markdown, or wiki forms could also be generated.

gasyoun commented 7 years ago

mw_printchange

That's more than a single letter print change. That's a semantic shift, more than that - it annihilates one of the meanings as not known outside PWG and MW.

wouldn't it be good if there were a Sanskrit language StackExchange?

Let's forget about it. It was the 5th discussion on BVP related to real word corrections in dictionaries since 2005.

other views (such as that for stardict forms) or simple text forms, or markdown, or wiki forms could also be generated.

But only after the DTD is finalized, right.

vvasuki commented 7 years ago

Thanks for the explanation, @funderburkjim .

Conceptually, the data flow I imagine would be like this:

Now, this final higher level abstraction would be what you'd export downstream in the form of json or anything else. If I understand correctly, your xml is currently somewhere in between the txt and this final abstraction, but you would want to move it as close as possible to the latter. Is that so?

Separately, the mw_printchange ought to be made more easily machine processible, I think..

And, least important notes:

gasyoun commented 7 years ago

semantic elements like footnotes, references, verse boundaries

Structural, not semantic.

Separately, the mw_printchange ought to be made more easily machine processible, I think..

What exactly do you mean?

vvasuki commented 7 years ago

semantic elements like footnotes, references, verse boundaries

Structural, not semantic.

References and verse boundaries are definitely structural as well as semantic elements. Footnotes are just one possible structural expression of a semantic entity (further explanation/ note ancillary to the main point), which is what merits preservation in that file.

Separately, the mw_printchange ought to be made more easily machine processible, I think..

What exactly do you mean?

Make it a JSON or something similar.

artforlife commented 5 years ago

Here is a way to get the whole url to be preprocessed by a php program ---

.htaccess

RewriteEngine On    # Turn on the rewriting engine
RewriteRule ^api/(.*)$ http://www.sanskrit-lexicon.uni-koeln.de/apitest/index.php?parms=$1

php program apitest/index.php

<?php
/* Example from stackoverflow
http://stackoverflow.com/questions/6768793/get-the-full-url-in-php
$url =  "//{$_SERVER['HTTP_HOST']}{$_SERVER['REQUEST_URI']}";
*/
$host = $_SERVER['HTTP_HOST'];
$uri = $_SERVER['REQUEST_URI'];
$parmstring = preg_replace('/.*?parms=/','',$uri);
$parms = explode('/',$parmstring);
//

list($display,$dict,$input,$key1,$output,$accentcode) = $parms;
$year = '2014';
$dictup = strtoupper($dict);

if ($accentcode == 'ignoreaccent') {
 $accent = 'off';
}else {
 $accent = 'on';
}
$newurl = sprintf("http://www.sanskrit-lexicon.uni-koeln.de/scans/%sScan/%s/web/webtc/getword.php?key=%s&filter=%s&noLit=off&transLit=%s",$dictup,$year,$key1,$output,$input);
// redirect. THIS MUST BE FIRST OUTPUT
header('Location:'.$newurl);
//displayinfo($host,$uri,$parmstring,$parms,$newurl);
function displayinfo($host,$uri,$parmstring,$parms,$newurl) {
echo  "HTTP_HOST=$host<br/>REQUEST_URI=$uri<br/>";
echo "parmstring=$parmstring<br/>";
for($i=0;$i<count($parms);$i++) {
 $val = $parms[$i];
 echo "parms[$i]=$val<br/>";
}
echo "newurl=$newurl<br/>";
}

?>

calling sequence

http://www.sanskrit-lexicon.uni-koeln.de/apitest/api/entries/PW/slp1/citra/deva/ignoreaccent

or lower case pw
http://www.sanskrit-lexicon.uni-koeln.de/apitest/api/entries/pw/slp1/citra/deva/ignoreaccent

This index.php program could probably be quite elaborate.

Is it possible to add similar URL parsing to getword.php file? That is, we would pass a clean URL to that file, do the parsing there and use the needed parameters within the file, as needed? I have not seen the getword.php, but it seems that it would be a quick change to make.

Also, we could then have an Apache rewrite rule to point from this type of URL

http://www.sanskrit-lexicon.uni-koeln.de/apitest/api/entries/pw/slp1/citra/deva/ignoreaccent

to this type

https://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/web/webtc/

That is, we may not need that intermediary index.php URL-parser at all if getword.php is doing its own URL parsing.

I maybe missing something since I have not seen any code, but it seems to be a way to go about doing it.

gasyoun commented 5 years ago

@funderburkjim and @drdhaval2785 let me introduce to you @artforlife Yakov, hope this time it's for real. He want's to see the frontend files (all of them) and I do not know how to help him with that. First of all getword.php, but after that he still will need access to the test server. Forgot if that was possible.

funderburkjim commented 5 years ago

@gasyoun @artforlife

'frontend files'

See description of dictionary_init.sh.

This indicates how to get local version. Such as integrated with local XAMPP server.

Is this enough to get you started?

gasyoun commented 5 years ago

See description of dictionary_init.sh.

Thanks a lot, again.

Is this enough to get you started?

Hmm, no.

1) some general Cologne programs - a perfect entering point. Where I can have all?

2) dictionary_init.sh downloads a 'working environment' for a given dictionary - but where Yakov could download for local testing or one day update to a test server that you Jim can access?

3) Where is getword.php ?

funderburkjim commented 5 years ago

getword.php

In the current setup, there is a 'getword.php' in two places:

Probably you want the apidev version.

Here's how to get apidev.

Now you can access getword:

image

Important Note: This display works for 'mw' because the web display for mw has also been installed: \c\xampp\htdocs\cologne\mw\web. If you don't have mw installed locally, you'll get a 'not found' message back from getword.

Is this enough to get you started? I've lost track of what you are trying to accomplish here.

funderburkjim commented 5 years ago

php htaccess

Good Impression from first quick read of @artforlife 's notes . Will give it a try at Cologne when time permits.

artforlife commented 5 years ago

@funderburkjim Much appreciated. I shall try out your suggestions shortly and let you know how everything goes.

artforlife commented 5 years ago

@funderburkjim I was able to follow your directions; however, I cannot get the getword.php to look up words. Instead, I am getting the following output:

image

As you suggested, my directory structure looks like this:

- cologne 
  --- apidev
  --- mw

getword.php is not working from either apidev or mw/web/webtc.

If you have an idea of what I am missing, I'll be happy to hear it. Otherwise, I'll finish installing some tools and debug it tomorrow.

gasyoun commented 5 years ago

I'll finish installing some tools and debug it tomorrow

Seems it will go this way.

artforlife commented 5 years ago

I am in business.

image

gasyoun commented 5 years ago

@artforlife ok, so you've got it running. Everything needed for the rewrite rule testing?

artforlife commented 5 years ago

@artforlife ok, so you've got it running. Everything needed for the rewrite rule testing?

We shall see. I'll play with that next. Do we know what the general steps for committing and testing are?

artforlife commented 5 years ago

I have a local version working. I was able to simplify it and perform the entire thing using only the rewrite rules. No additional index.php was needed.

Here is how it works.

Inside the cologne directory, we have the following .htaccess file:

RewriteEngine On    # Turn on the rewriting engine
RewriteRule ^(api)/([^/]*)/([^/]*)/([^/]*)/([^/]*) apidev/getword.php?dict=$2&key=$3&input=$4&output=$5

As you can probably gather from the rewrite rule, our call to the API will need to look something like this: http://sites.dev/sanskrit-dict/cologne/api/mw/hari/slp1/iast

When executed in the browser, we get

screenshot from 2019-02-17 22-35-51

which is the same as the direct call

screenshot from 2019-02-17 22-41-07

There is a minor issue with styles not being applied, but I have not even cared to investigate. This is a POC (proof-of-concept) example rather than some polished, publishable snippet. Is this what we wanted to achieve?