Open drdhaval2785 opened 7 years ago
These urls are of a style that I have seen with various Python web frameworks, such as Django and flask.
http://example.com/value1/value2/value3
It may also be a feature of Ruby on Rails web framework, which may be what runs GitHub.
It is probably an art to making such urls 'intuitive' to the user of a website.
Essentially, the servers for these urls send the url to a routing function, which interprets the sequence of values and acts accordingly.
Another style of restful api is to use parameters:
http://example.com?parm1=value1&parm2=value2&parm3=value3
This is the style that I know how to implement with php; and all the restful interfaces at Cologne sanskrit-lexicon site are of this form.
It might be that using Apache rewrite rules (e.g. by an '.htaccess file reference) would be a way to turn the first style into the second style.
Or maybe (probably) there is some way that php has to directly handle urls of the first type. Maybe you can do some research and find a reference on this.
The sending back of response in JSON is not particularly hard, whichever style of url is used. Since JSON definition includes not only Objects {'x':'y', 'z':'w'}, arrays [x1, x2, x3], but also constants 'X', 274, etc ==== We can say that Cologne restful apis already return JSON: namely, in the form (for the most part) of string constants representing HTML (usually).
Still a third way of sending parameters is to send JSON to the server, e.g.
{"parm1":"value1", "parm2":"value2"...}
This is easy with JQuery .ajax
If I'm not mistaken, this latter JSON way of sending from browser to server is the form @juhnowski
favors.
Probaby the first thing that needs to be done is to have an inventory of all the restful (in style 2) interfaces that currently exist at Cologne; and a specification of the type of data returned. This would serve as a reference point for designing a better API, for whichever restful style we move towards. This will also give a basis for identifying what kinds of JSON objects (rather than just html string constants) the server should return in response to the restful inputs.
Just to get discussion started on two APIs, I have scribbled two such itens
entries/dictcode/inputtransliteration/headword/outputtransliteration/ignoreaccent
This is the style that I know how to implement with php; and all the restful interfaces at Cologne sanskrit-lexicon site are of this form.
From SEO - it's worst possible way. And for the user it's no good as well, right.
Or maybe (probably) there is some way that php has to directly handle urls of the first type. Maybe you can do some research and find a reference on this.
Do not think so. I usually do with mod rewrite. But I do not deal with APIs.
Regarding implementation of point 1
RewriteEngine on
RewriteRule ^suggest/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*) http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc2/query.php?word=&lastLnum=0&max=$5&filter=deva®exp=exact&scase=true&sword=$4&sregexp=$2&transLit=$3&outopt=outopt4&swordhw=hwonly
http://localhost/apitrial/suggest/PW/prefix/slp1/Dava/100 gave result like this
-1#
1 धवनी
2 धवर
3 धवल
4 धवलगिरि
5 धवलगृह
6 धवलता
7 धवलनिबन्ध
8 धवलपक्ष
9 धवलपुराणसमुच्चय
10 धवलमुख
11 धवलमृत्तिका
12 धवलय्
13 धवलाय्
14 धवलाष्टक
15 धवलिमन्
16 धवली
17 धवलेतरतण्डुल
18 धवलोत्पल
@funderburkjim The only drawback seems to be the capital PW in the whole url. It is a bit difficult to modify it to lowercase in rewrite rule. Jim can handle it in backend. The dictionary code passed in small letters can be converted to capital on his part in PHP.
So it seems easily doable to use more user friendly URLs with existing infrastructure (with very less modification too). Two lines of .htaccess file is the only code I wrote for this result to occur.
Regarding implementation of API no. 2 -
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*) http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$3&filter=$4&noLit=off&transLit=$2
http://localhost/apitrial/entries/PW/slp1/citra/deva/ignoreaccent
gave following result
चित्र [L=39963] [p= 2228-1] | — 1) Adj. ( f. आ) — a) augenfällig , sichtbar , ausgezeichnet. — b) hell , licht , hellfarbig. °म् Adv. Ṛv.1,71,1.6,65,2. — c) hell , vernehmlich (von Tränen). [Page2.228-2] — d) verschiedenfarbig , bunt , scheckig. Das , was die Verschiedenfarbigkeit bildet , im Instr. oder im Comp. vorangehend. — e) bewegt (vom Meere). — f) mannichfaltig , verschieden , allerlei. °म् und चित्र° Adv. — g) qualificirt , mit verschiedenen Martern verbunden (Strafe , Hinrichtung) 203,26. °म् Adv. unter verschiedenen Martern. — h) wunderbar. Spr.5087. — i) das Wort चित्र enthaltend. — 2) m. — a) *Buntheit. — b) ®*Plumbago_zeylanica. — c) ®*Ricinus_communis. — d) ®*Jonesia_Asoka. — e) eine Form Jama's. — f) N.pr. — α) verschiedener Männer (parox. Ṛv.). — β) *eines Gandharva Gal. — 3) f. आ — a) Sg. und Pl. das 12te (später das 14te) Mondhaus. — b) *eine Schlangenart. — c) Bez. verschiedener Pflanzen Ḱaraka.7,12. ( = द्रवन्ती). Nach den Lexicographen: ®Salvinia_cucullata , ®Cucumis_maderaspatanus , Koloquinthe , ®Ricinus_communis , ®Croton_polyandrum oder Tiglium , Myrobalanenbaum , ®Rubia_Munjista und ein best. Gras ( गण्डदूर्वा). — d) Bez. verschiedener Metra. — e) ein best. Saiteninstrument S.s.s.185. — f) ein best. Mûrḱhanâ S.s.s.30. — g) *Schein , Täuschung. — h) N.pr. — α) *einer Apsaras. — β) verschiedener Frauen. — γ) eines Felsens. — δ) *eines Flusses. — 4) n. — a) eine helle , glänzende oder farbige Erscheinung , ein in die Augen fallender Gegenstand , ein funkelndes Geschmeide , Schmuck. — b) *verschiedenfarbiges oder verschiedengestaltetes Gehölz. — c) Fleck , macula. — d) *Sectenzeichen auf der Stirn. — e) *der weisse Aussatz. — f) Bild , Gemälde , Malerei. Am Ende eines adj. Comp. f. आ Megh.64. — g) *buntheit. — h) eine ungewöhnliche Erscheinung , Wunder. Mit folgenden यदि , यद् oder *Fut. चित्रम् als Ausruf so v.a. o. Wunder 123,22.134,29.174,11. Spr.7811. — i) *der Luftraum , Himmel. — k) Bez. verschiedener Arten , künstliche Verse u.s.w. in Form von allerlei Figuren durch Nichtwiederholung wiederkehrender Silben oder Wörter in abgekürzter Weise künstlich für das Auge darzustellen Kâvjapr.9,8. Wort- und Lautspiel. |
@funderburkjim and @gasyoun
After experimenting a bit with Cologne server APIs, I feel that the work towards RESTful APIs + Clean URLs is just a bit of mod rewrite modules + some regex magic. So it is time to define all parameters and output format for proper APIs.
Jim may like to list the current APIs (Long URL types). I will convert it to some user friendly APIs via rewrite.
It might be that using Apache rewrite rules (e.g. by an '.htaccess file reference) would be a way to turn the first style into the second style.
And it turns out to be damn easy.
Server needs these two commands to enable mod rewrite
a2enmod rewrite
service apache2 restart
Then the following .htaccess file needs to be put in api folder in Cologne server.
RewriteEngine on
RewriteRule ^suggest/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*) http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc2/query.php?word=&lastLnum=0&max=$5&filter=deva®exp=exact&scase=true&sword=$4&sregexp=$2&transLit=$3&outopt=outopt4&swordhw=hwonly
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*) http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$3&filter=$4&noLit=off&transLit=$2
http://localhost/apitrial/entries/PW/slp1/citra/deva/ignoreaccent
I would go for http://www.sanskrit-lexicon.uni-koeln.de/MD/citra/SDI/ instead of http://www.sanskrit-lexicon.uni-koeln.de/scans/MDScan/2014/web/webtc/indexcaller.php
/SDI/
These service pages we will close for indexation. S for SLP1, D for devanagari, I for ignore accents.
I would go for http://www.sanskrit-lexicon.uni-koeln.de/MD/citra/SDI/
Doable, but seems non-intuitive.
Currently making a list of existing APIs at Cologne.
@drdhaval2785 Good research on htaccess.
I tried one of your examples at Cologne as follows:
RewriteEngine On # Turn on the rewriting engine
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*) http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$3&filter=$4&noLit=off&transLit=$2```
Usage example: http://www.sanskrit-lexicon.uni-koeln.de/apitest/entries/PW/slp1/citra/deva/ignoreaccent
This shows:
https://github.com/sanskrit-lexicon/Cologne/tree/master/api
This houses various documentations I have started regarding Clean URL development. You can copy paste rewrite rules from there and see whether it rolls out well.
Here is a way to get the whole url to be preprocessed by a php program ---
RewriteEngine On # Turn on the rewriting engine
RewriteRule ^api/(.*)$ http://www.sanskrit-lexicon.uni-koeln.de/apitest/index.php?parms=$1
<?php
/* Example from stackoverflow
http://stackoverflow.com/questions/6768793/get-the-full-url-in-php
$url = "//{$_SERVER['HTTP_HOST']}{$_SERVER['REQUEST_URI']}";
*/
$host = $_SERVER['HTTP_HOST'];
$uri = $_SERVER['REQUEST_URI'];
$parmstring = preg_replace('/.*?parms=/','',$uri);
$parms = explode('/',$parmstring);
//
list($display,$dict,$input,$key1,$output,$accentcode) = $parms;
$year = '2014';
$dictup = strtoupper($dict);
if ($accentcode == 'ignoreaccent') {
$accent = 'off';
}else {
$accent = 'on';
}
$newurl = sprintf("http://www.sanskrit-lexicon.uni-koeln.de/scans/%sScan/%s/web/webtc/getword.php?key=%s&filter=%s&noLit=off&transLit=%s",$dictup,$year,$key1,$output,$input);
// redirect. THIS MUST BE FIRST OUTPUT
header('Location:'.$newurl);
//displayinfo($host,$uri,$parmstring,$parms,$newurl);
function displayinfo($host,$uri,$parmstring,$parms,$newurl) {
echo "HTTP_HOST=$host<br/>REQUEST_URI=$uri<br/>";
echo "parmstring=$parmstring<br/>";
for($i=0;$i<count($parms);$i++) {
$val = $parms[$i];
echo "parms[$i]=$val<br/>";
}
echo "newurl=$newurl<br/>";
}
?>
http://www.sanskrit-lexicon.uni-koeln.de/apitest/api/entries/PW/slp1/citra/deva/ignoreaccent
or lower case pw
http://www.sanskrit-lexicon.uni-koeln.de/apitest/api/entries/pw/slp1/citra/deva/ignoreaccent
This index.php program could probably be quite elaborate.
One thing that is undesireable in these approaches is that the browser address-bar gets changed to
the ?x=y&z=w
form --- In other words, the original desired calling sequence gets clobbered.
e.g., for the first example, the address bar changes to the rewritten form:
http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/web/webtc/getword.php?key=citra&filter=deva&noLit=off&transLit=slp1
Is there some remedy for this?
http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/web/webtc/getword.php?key=aGga&output=roman&input=hk
Doable, but seems non-intuitive.
At least it's short and SEO is kept in mind.
Because
http://www.sanskrit-lexicon.uni-koeln.de/apitest/api/entries/pw/slp1/citra/deva/ignoreaccent
8 levels deep - we'll have indexation issues. I would want people to find not only the main page, but to find us googling for words as well.
Is there some remedy for this?
Did not get the question? Get rid of ?
or what?
Address-bar problem
Tried to do some research.
Enable proxy module on apache2
Add [P] flag at the end
At the end of it all, .htaccess reads like this
Options +FollowSymLinks -MultiViews
RewriteEngine On
RewriteBase /
RewriteEngine on
RewriteRule ^suggest/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*) http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc2/query.php?word=&lastLnum=0&max=$5&filter=deva®exp=exact&scase=true&sword=$4&sregexp=$2&transLit=$3&outopt=outopt4&swordhw=hwonly [P]
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]+)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=$4&noLit=off&accent=$5&transLit=$3 [P]
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]+)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=$4&noLit=off&accent=$5&transLit=$3 [P]
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=$4&noLit=off&accent=no&transLit=$3 [P]
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=$4&noLit=off&accent=no&transLit=$3 [P]
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=deva&noLit=off&accent=no&transLit=$3 [P]
RewriteRule ^entries/([^/]*)/([^/]*)/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=deva&noLit=off&accent=no&transLit=$3 [P]
RewriteRule ^entries/([^/]*)/([^/]*)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=deva&noLit=off&accent=no&transLit=slp1 [P]
RewriteRule ^entries/([^/]*)/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/getword.php?key=$2&filter=deva&noLit=off&accent=no&transLit=slp1 [P]
RewriteRule ^pdf/([^/]*)/word/([^/]*)$/ http://www.sanskrit-lexicon.uni-koeln.de/scans/awork/apidev/servepdf.php?dict=$1&key=$2 [P]
RewriteRule ^pdf/([^/]*)/word/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/awork/apidev/servepdf.php?dict=$1&key=$2 [P]
RewriteRule ^pdf/([^/]*)/page/([^/]*)$/ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/servepdf.php?page=$2 [P]
RewriteRule ^pdf/([^/]*)/page/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc/servepdf.php?page=$2 [P]
RewriteRule ^list/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]+)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=$3&serverOptions=$4&accent=$5&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]+)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=$3&serverOptions=$4&accent=$5&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)/([^/]*)/([^/]*)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=$3&serverOptions=$4&accent=no&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)/([^/]*)/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=$3&serverOptions=$4&accent=no&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)/([^/]*)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=$3&serverOptions=deva&accent=no&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=$3&serverOptions=deva&accent=no&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)/$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=slp1&serverOptions=deva&accent=no&viewAs=phonetic [P]
RewriteRule ^list/([^/]*)/([^/]*)$ http://www.sanskrit-lexicon.uni-koeln.de/scans/$1Scan/2014/web/webtc1/listhier.php?key=$2&keyboard=yes&inputType=phonetic&unicodeInput=devInscript&phoneticInput=slp1&serverOptions=deva&accent=no&viewAs=phonetic [P]
This is a splendid project, gentlemen! @drdhaval2785 informed me about it on the https://groups.google.com/forum/#!topic/sanskrit-programmers/wdhMuXGpc1E thread, where I announced a similar API for all the dicts available with the stardict-sanskrit etc. projects.
Some suggestions from my experience (see links in thread above) is that in terms of rapid development and maintainability (without sacrificing any of the frontend-backend separation):
That apart, some other feedback:
headwords
field).PS:
couchdb
some starters may help.
but you will have to compropmise on the nice intuitive api structure
Not possible to have best of both worlds?
@drdhaval2785 , http://www.sanskrit-lexicon.uni-koeln.de/apitest/entries/PW/slp1/citra/deva/ignoreaccent is redirecting to some other page which is not returning a JSON result.
It is not yet configured to return JSON yet. I have currently piggybacked by rewrites on present Cologne scripts (which are made to return HTML). SO the documentation is ready. Rewrites are ready. Minor modifications on backend script or a version thereof is pending which returns JSON instead of HTML. That part is majorly Jim's lookout.
some starters may help.
@drdhaval2785 http://docs.couchdb.org/en/2.0.0/contents.html has a good intro. I found it quite simple to setup and use.
(One another thing I like about couchdb is that you can replicate the db in couchbase-lite db files for offline use is mobiles etc..)
Not possible to have best of both worlds?
Certainly yes. Start with couchdb, write a very thin wrapper to translate the nice API you desire to the couchdb API while interacting with it. You might want to do it anyway to provide support for querying over ssl (ie https - couchdb2 is currently buggy there).
2017-06-04 18:49 GMT+03:00 Vishvas Vasuki notifications@github.com:
some starters may help.
@drdhaval2785 https://github.com/drdhaval2785 http://docs.couchdb.org/en/2.0.0/contents.html has a good intro. I found it quite simple to setup and use.
(One another thing I like about couchdb is that you can replicate the db in couchbase-lite db files for offline use is mobiles etc..)
Not possible to have best of both worlds?
Certainly yes. Start with couchdb, write a very thin wrapper to translate the nice API you desire to the couchdb API while interacting with it. You might want to do it anyway to provide support for querying over ssl (ie https - couchdb2 is currently buggy there).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/Cologne/issues/117#issuecomment-306048357, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC44z0WDOEAUa0QRswwgcRuDafthFQ4ks5sAtIWgaJpZM4M-OkH .
I have used CouchDB / PouchDB for years, its splendid
-- М.
http://diglossa.ru xmpp://m.bykov@jabber.ru
@mbykov , great to know! will follow up on your issues pages with questions.
Others, please look at http://diglossa.org:5984/_utils/index.html (meant for database managers/ developers, not end users) for the database UI from an earlier couchdb version to get a feel for it.
Another comment as you proceed: since you're doing a major rewrite, you might consider switching away from php to - say python or scala if you like. I've written a python web service with flask_restplus with only a few lines of code and found it very useful. You easily get stuff like self-documenting api:
consider switching away from php to - say python
Too much attention to UI and backend will kill the dictionary cleanup. If only @mbykov could help us move in the CouchDB direction that @vvasuki proposed.
2017-06-04 22:53 GMT+03:00 Mārcis Gasūns notifications@github.com:
consider switching away from php to - say python
Too much attention to UI and backend will kill the dictionary cleanup. If only @mbykov https://github.com/mbykov could help us move in the CouchDB direction that @vvasuki https://github.com/vvasuki proposed.
yes, these are different tasks, its should be separated.
I can do that a bit later - after a week or two - I'll write here
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/Cologne/issues/117#issuecomment-306062893, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC44y3LXdZteEdJOP1XVB-LcvMO7PQdks5sAwsfgaJpZM4M-OkH .
-- М.
http://diglossa.ru xmpp://m.bykov@jabber.ru
@vvasuki and @mbykov
As I see it, it's best to do development work with couchdb on a separate server.
We have scripts that allow easy duplication of much of current cologne environment elsewhere.
One or both of you should take the lead in this. I think my time is best spent in the current task of 'normalizing' the Cologne data. Developing JSON forms of the data require an intimate familiarity with the dictionary data, and Dhaval and I can provide scripts to generate JSON forms that you require, once the details of the requirement are clear.
As to programming language, my vote would be for Python. Maybe Python 3, since Python Foundation has stated Python 2 development will stop in 2020. Note the Cologne environment only has Python 2.6 currently.
I would suggest setting up a server on DigitalOcean, just for the purpose of carrying forth the CouchDB ideas both of you are suggesting. This could be done so that all interested parties have access. I'd be glad to do this in whatever way would be most conducive for the development of this idea.
Incidentally, I've done some investigation of ElasticSearch as a good backend; since the search capabilities of Lucene could be brought to bear. Do you have any thoughts on how the benefits of CouchDB compare to those of ElasticSearch?
Let me know if the DigitalOcean idea seems good to you, and if so, we can start developing the details of what is needed.
As to programming language, my vote would be for Python. Maybe Python 3, since Python Foundation has stated Python 2 development will stop in 2020.
Good choice - especially since its more familiar to more folks involved compared to php etc..
Incidentally, I've done some investigation of ElasticSearch as a good backend; since the search capabilities of Lucene could be brought to bear. Do you have any thoughts on how the benefits of CouchDB compare to those of ElasticSearch?
advantages for couchdb -
advantages of elasticsearch -
Nothing stops us from having both. We could start with couchdb and add elasticsearch when the time comes to implement full-text querying.
I think my time is best spent in the current task of 'normalizing' the Cologne data. Developing JSON forms of the data require an intimate familiarity with the dictionary data, and Dhaval and I can provide scripts to generate JSON forms that you require, once the details of the requirement are clear.
Good - that's correct. This is infact almost all of what needs to be done (as far as the backend is concerned). Reading a list of JSON documents corresponding to each entry from files and dumping them in a database is relatively easier - @mbykov or I can help you with that depending on our availability when the json files are ready.
As I see it, it's best to do development work with couchdb on a separate server.
Indeed - I always work with a couchdb server installed on my laptop. Once the database is ready in the local server, one can replicate it in a remote server with a few clicks and a long wait. So, you don't really need to set up any server for us except what you'll finally use in production.
Regarding the JSON format, I request you to add another field: "correction", where dictionary blunders such as https://groups.google.com/forum/#!topic/bvparishat/ntuaembSOsg can be noted and displayed with citation, so that users don't waste their cycles being misled.. It's easy to publish a google form to accept such inputs.
add another field: "correction",
Makes sense as an optional field for every meaning, not just whole word.
It's beyond my competence to evaluate the comment on pUjana mentioned in the above link. But let's assume that the comment is correct, and that the MW dictionary entry for pUjana is misleading. How do we bring this information into the mw.xml digitization?
The displays already have a link to a Correction form, so that is one way for a random user to bring such insights to our attention.
We have a classification of errors that we call 'print errors', which documents where we have intentionally changed the digitization to be different than the printed edition. We keep a log of these for each dictionary; for instance, there is mw_printchange. This mw_printchange file might be one candidate for where to include such scholarly comments pertaining to a given word as in the pUjana example.
Currently, we have no linking mechanism between the dictionary database and this ancillary printchange file. A <correction/>
flag or <correction href="mw_printchange.txt">FURTHER DISCUSSION</correction>
indicating the existence of an entry in mw_printchange could be placed in the mw.xml record for pUjana, with the
assumption that there is a discussion available that may be accessed by a link to the printchange file.
This then outlines a partial solution that is relatively 'near' to the current configuration of the Cologne digitizations.
But @vvasuki 's comment is provocative -- it suggests many possibilities beyond this partial solution. For instance, wouldn't it be good if there were a Sanskrit language StackExchange? Then when a Cologne (or other) dictionary display of pUjana was generated, the display could poll the Sanskrit language Stackexchange API to see if there was discussion there of pUjana, and generate a link there if such a discussion were found.
This comment is directed to @vvasuki and @mbykov . It provides a succinct overview of the data structures involved in the cologne digitizations.
Each of the dictionary digitizations is separate, and identified by a dictionary code (mw, pwg, bur, vcp, etc.) Let 'xxx' denote one of these codes.
There is a primary form of the dictionary , xxx.txt. This is a hybrid form, based closely on the original digitization from Thomas Malten. It is the form to which 'corrections' are made. We are currently in the process of making this form more regular, and similar among dictionaries.
From this primary form, we generate an xml form, xxx.xml. From xxx.xml, we generate in a completely regular way a sqlite3 database xxx.sqlite; This is the database on which all the displays depend. (NOTE: the 'Advanced search' display also depends on a separate file generated from xxx.xml, in order to facilitate full text searches. This is a substitute for an inverted index, with one advantage of permitting substring searches.)
The xxx.sqlite structure is quite simple - a table with three columns; a row corresponds to an entry in the dictionary.
The xxx.xml record itself has some quite regular parts and some irregular parts. The regular parts are the same for all dictionaries; the irregular parts are currently less regular. There is a document type definition file xxx.dtd to which the xxx.xml file validates.
The root of the xxx.xml file is <xxx>
. It's easiest to describe the general structure in DTD terms:
<!ELEMENT xxx (H1)*> <!-- for MW, there are also H2, H3, H4, H1A, etc. -->
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts ... > <!-- This is the variable part. It is a markup of the text of the entry -->
<!-- h element : 'h' for 'head' -->
<!ELEMENT h (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 --> <!-- same as sqlite key1 field -->
<!ELEMENT key2 (#PCDATA )><!-- often in slp1-->
<!ELEMENT hom (#PCDATA) > <!-- homonym identifier - optional -->
<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc | ETC?)*>
<!ELEMENT L (#PCDATA) > <!-- same as sqlite L field -->
<!ELEMENT pc (#PCDATA) > <!-- page-column information (for links to scanned images -->
<!-- ETC? possibly some other 'meta' elements, e.g. pertaining to alternate headword spellings -->
One key element within the <body>
element is the <s>
element, which is used to identify text appearing in Devanagari in the printed text; the textual contents of such Sanskrit text is coded in SLP1
transliteration.
It seems likely that most of the fields of a JSON form of the data would be quite closely derived from the 'H1', 'h', and 'tail' elements and that the 'body' field would retain its xml structure.
Of course, views of the data currently require interpretation of the xml within the <body>
.
In the current Cologne configuration, a view is generated as html (by a php program disp.php) and the correct viewing in a browser depends on this html in conjunction with CSS.
The generation of this view also depends on a choice of how to represent the Sanskrit text (<s>
element) - Devanagari, IAST, HK, SLP1, ITRANS.
This view (disp.php) currently has some details peculiar to the dictionary. We hope to regularize the xml structure so that one DTD governs all the xxx.xml files. Then, not only could one php program generate a reasonable html display for any dictionary. But also, a python or Javascript program could also generate the same html display. And, other views (such as that for stardict forms) or simple text forms, or markdown, or wiki forms could also be generated.
mw_printchange
That's more than a single letter print change. That's a semantic shift, more than that - it annihilates one of the meanings as not known outside PWG and MW.
wouldn't it be good if there were a Sanskrit language StackExchange?
Let's forget about it. It was the 5th discussion on BVP related to real word corrections in dictionaries since 2005.
other views (such as that for stardict forms) or simple text forms, or markdown, or wiki forms could also be generated.
But only after the DTD is finalized, right.
Thanks for the explanation, @funderburkjim .
Conceptually, the data flow I imagine would be like this:
Now, this final higher level abstraction would be what you'd export downstream in the form of json or anything else. If I understand correctly, your xml is currently somewhere in between the txt and this final abstraction, but you would want to move it as close as possible to the latter. Is that so?
Separately, the mw_printchange ought to be made more easily machine processible, I think..
And, least important notes:
semantic elements like footnotes, references, verse boundaries
Structural, not semantic.
Separately, the mw_printchange ought to be made more easily machine processible, I think..
What exactly do you mean?
semantic elements like footnotes, references, verse boundaries
Structural, not semantic.
References and verse boundaries are definitely structural as well as semantic elements. Footnotes are just one possible structural expression of a semantic entity (further explanation/ note ancillary to the main point), which is what merits preservation in that file.
Separately, the mw_printchange ought to be made more easily machine processible, I think..
What exactly do you mean?
Make it a JSON or something similar.
Here is a way to get the whole url to be preprocessed by a php program ---
.htaccess
RewriteEngine On # Turn on the rewriting engine RewriteRule ^api/(.*)$ http://www.sanskrit-lexicon.uni-koeln.de/apitest/index.php?parms=$1
php program apitest/index.php
<?php /* Example from stackoverflow http://stackoverflow.com/questions/6768793/get-the-full-url-in-php $url = "//{$_SERVER['HTTP_HOST']}{$_SERVER['REQUEST_URI']}"; */ $host = $_SERVER['HTTP_HOST']; $uri = $_SERVER['REQUEST_URI']; $parmstring = preg_replace('/.*?parms=/','',$uri); $parms = explode('/',$parmstring); // list($display,$dict,$input,$key1,$output,$accentcode) = $parms; $year = '2014'; $dictup = strtoupper($dict); if ($accentcode == 'ignoreaccent') { $accent = 'off'; }else { $accent = 'on'; } $newurl = sprintf("http://www.sanskrit-lexicon.uni-koeln.de/scans/%sScan/%s/web/webtc/getword.php?key=%s&filter=%s&noLit=off&transLit=%s",$dictup,$year,$key1,$output,$input); // redirect. THIS MUST BE FIRST OUTPUT header('Location:'.$newurl); //displayinfo($host,$uri,$parmstring,$parms,$newurl); function displayinfo($host,$uri,$parmstring,$parms,$newurl) { echo "HTTP_HOST=$host<br/>REQUEST_URI=$uri<br/>"; echo "parmstring=$parmstring<br/>"; for($i=0;$i<count($parms);$i++) { $val = $parms[$i]; echo "parms[$i]=$val<br/>"; } echo "newurl=$newurl<br/>"; } ?>
calling sequence
http://www.sanskrit-lexicon.uni-koeln.de/apitest/api/entries/PW/slp1/citra/deva/ignoreaccent or lower case pw http://www.sanskrit-lexicon.uni-koeln.de/apitest/api/entries/pw/slp1/citra/deva/ignoreaccent
This index.php program could probably be quite elaborate.
Is it possible to add similar URL parsing to getword.php
file? That is, we would pass a clean URL to that file, do the parsing there and use the needed parameters within the file, as needed? I have not seen the getword.php
, but it seems that it would be a quick change to make.
Also, we could then have an Apache rewrite rule to point from this type of URL
http://www.sanskrit-lexicon.uni-koeln.de/apitest/api/entries/pw/slp1/citra/deva/ignoreaccent
to this type
https://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/web/webtc/
That is, we may not need that intermediary index.php URL-parser at all if getword.php
is doing its own URL parsing.
I maybe missing something since I have not seen any code, but it seems to be a way to go about doing it.
@funderburkjim and @drdhaval2785 let me introduce to you @artforlife Yakov, hope this time it's for real. He want's to see the frontend files (all of them) and I do not know how to help him with that. First of all getword.php
, but after that he still will need access to the test server. Forgot if that was possible.
@gasyoun @artforlife
'frontend files'
See description of dictionary_init.sh.
This indicates how to get local version. Such as integrated with local XAMPP server.
Is this enough to get you started?
See description of dictionary_init.sh.
Thanks a lot, again.
Is this enough to get you started?
Hmm, no.
1) some general Cologne programs - a perfect entering point. Where I can have all?
2) dictionary_init.sh downloads a 'working environment' for a given dictionary - but where Yakov could download for local testing or one day update to a test server that you Jim can access?
3) Where is getword.php
?
In the current setup, there is a 'getword.php' in two places:
Probably you want the apidev version.
Here's how to get apidev.
Now you can access getword:
Important Note: This display works for 'mw' because the web display for mw has also been installed: \c\xampp\htdocs\cologne\mw\web. If you don't have mw installed locally, you'll get a 'not found' message back from getword.
Is this enough to get you started? I've lost track of what you are trying to accomplish here.
php htaccess
Good Impression from first quick read of @artforlife 's notes . Will give it a try at Cologne when time permits.
@funderburkjim Much appreciated. I shall try out your suggestions shortly and let you know how everything goes.
@funderburkjim I was able to follow your directions; however, I cannot get the getword.php
to look up words. Instead, I am getting the following output:
As you suggested, my directory structure looks like this:
- cologne
--- apidev
--- mw
getword.php
is not working from either apidev
or mw/web/webtc
.
If you have an idea of what I am missing, I'll be happy to hear it. Otherwise, I'll finish installing some tools and debug it tomorrow.
I'll finish installing some tools and debug it tomorrow
Seems it will go this way.
I am in business.
@artforlife ok, so you've got it running. Everything needed for the rewrite rule testing?
@artforlife ok, so you've got it running. Everything needed for the rewrite rule testing?
We shall see. I'll play with that next. Do we know what the general steps for committing and testing are?
I have a local version working. I was able to simplify it and perform the entire thing using only the rewrite rules. No additional index.php
was needed.
Here is how it works.
Inside the cologne
directory, we have the following .htaccess
file:
RewriteEngine On # Turn on the rewriting engine
RewriteRule ^(api)/([^/]*)/([^/]*)/([^/]*)/([^/]*) apidev/getword.php?dict=$2&key=$3&input=$4&output=$5
As you can probably gather from the rewrite rule, our call to the API will need to look something like this: http://sites.dev/sanskrit-dict/cologne/api/mw/hari/slp1/iast
When executed in the browser, we get
which is the same as the direct call
There is a minor issue with styles not being applied, but I have not even cared to investigate. This is a POC (proof-of-concept) example rather than some polished, publishable snippet. Is this what we wanted to achieve?
https://api.github.com/repos/drdhaval2785/siddhantakaumudi
Please look at the API structure of various links mentioned on this page. It is extremely intuitive. Can we design similar APIs for cologne where the URL itself sends necessary information and we send back the response JSON?
UPDATE
https://github.com/sanskrit-lexicon/Cologne/blob/master/api/apidoc.md is the place where I will be tracking various existing Cologne APIs and their rewrite rules.