Display enhancement suggestions from user

funderburkjim commented 7 years ago

A user recently made several suggestions regarding the MW displays.

These seem like interesting ideas, so that's why I'm mentioning them here.

The first one is easy to implement, so I did it.

However, the second and third would likely be quite tricky to implement, and cannot be addressed now.

1. suffix search in the (very old) display.

Can you please enlarge the capacities of the search tool of the Sanskrit dictionary http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/tamil/index.html in order to have the possibility of looking for words ending in X, eg. words ending in -van (jítvan, sútvan, jájvan), etc.

As mentioned, this is now implemented. Also, it is part of the MW advanced search.

2. Case insensitive search in MW advanced search

I am always very enthusiastic with your brilliant Sanskrit-lexicon tool, which is a powerful means to look for words in this beautiful language.

http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2014/web/webtc2/index.php

I regret however to say that I am not happy with the fact that it is capital-letter sensitive: dhisnya does not appear in the search but dhiSNya with the two retroflexes does. The same applies to long vowels. I feel very uncomfortable about this.

I looked into this - it is not clear how to implement. See below for further comments

3. Search for grammatical categories

(The user was probably thinking of adding this to the /tamil/recherche search mention in item 1 above. )

Perhaps it is also possible to separate the search between verbs, nouns, adjectives and particles.

Please note that this possibility exists in the Perseus Greek dictionary: http://www.perseus.tufts.edu/hopper/resolveform?redirect=true&display=Greek

4. Other comment

By the way, have you ever thought of including also the etymologies from the Mayrhofer dictionary?
- This book is still in copyright. However, it appears to be out of print, as only available through an antiquarian book seller in Germany.

funderburkjim commented 7 years ago

comment on case insensitive search in new versions of MW (or other displays)

The user is thinking of the /tamil/recherche display. That display is based solely on the HK transliteration.

If we contemplate implementing it in the newer displays (for MW or other dictionaries), even if we restrict to the case where the user has chosen HK for the input method, then the problem is harder. The reason is that the underlying spelling of sanskrit words in the newer dictionaries is in the SLP1 transliteration.

Consider the example if 'dhiSNya' (in HK). Now when this is lower-cased in HK, we get 'dhisnya'.

Now the user hopes to retrieve 'dhiSNya' when he enters 'dhisnya'.

But we are searching SLP1 spellings, so we are wanting to find the SLP1 spelling 'DizRya' .

How do we know that from 'dhisnya' ? Maybe we say that the 's' when converted to SLP1 can be either an 's' (dental sibilant) or a 'z' (cerebral sibilant). And maybe that 'n' can be either a normal dental 'n' or the cerebral nasal 'R' (in slp1).

So we would be looking for 4 SLP1 possibilities:

Disnya
Diznya
DisRya
DizRya

So, I guess if we searched for all four of these SLP1 spellings in our SLP1-based dictionary, that would be the same as doing a case-insensitive search in an HK-based dictionary for ''dhisnya" .

Of course, we'd have to do similar kinds of conversions for all the other upper/lower case HK spellings.

This looks like it is theoretically possible, but would be quite awkward.

Here are two other comments related to this suggestion:

Huet's dictionary now has a 'Sanskrit made easy' option. When I tried 'dhisnya' in it, I got the desired word ! So, this maybe should be thought of as an impetus to our Cologne site to somehow add such a feature.
- However, searching for 'dhiSnya' 'dhisNya', and 'dhiSNya' comes up empty.
Implementing this in the context of a more robust search engine would be conceptually easier. For, a Lucene-based search engine could have multiple access fields for the same document; i.e., in addition to our current access fields of 'key1' (and 'L'), there could be another access field (maybe named 'simpleHK') where we could precompute the lower-case HK access spellings.
- However, it might be harder to do suggestions, substring searches and suffix searches in such a system.

funderburkjim commented 7 years ago

Comment on 3 (grammatical categories)

If the dictionary displays were based on a search engine, this would be conceptually simpler. It would be a matter of adding a field (Say, 'gram') where we could store the grammatical categories of the entries. Note that for MW, these are already available (at least the noun, verb, indeclineable categories are known).

Another interesting category would be 'ls'; so we could search for records with a reference , say, to the Hitopadesha; and this could be done in conjunction with conditions on the spelling of the word. Again, this 'ls' information is identifiable for MW (and at least almost-identifiable for PW, PWG).

gasyoun commented 7 years ago

I regret however to say that I am not happy with the fact that it is capital-letter sensitive: dhisnya does not appear in the search but dhiSNya with the two retroflexes does. The same applies to long vowels. I feel very uncomfortable about this.

Yeah, a pseudo-HK would make sense, the 'simpleHK', as indeed - sometimes you do not know exactlt how the word is spelled. None of the sites has it at the level actually needed.

http://www.perseus.tufts.edu/hopper/search

This book is still in copyright. However, it appears to be out of print, as only available through an antiquarian book seller in Germany.

I have received written approval that from Mayrhofer, that I can use KEWA and EWA online. So that should not be an issue.

Another interesting category would be 'ls'; so we could search for records with a reference , say, to the Hitopadesha; and this could be done in conjunction with conditions on the spelling of the word.

Just like http://kjc-sv013.kjc.uni-heidelberg.de/dcs/index.php?contents=texte has.

funderburkjim commented 7 years ago

Mayrhofer, that I can use KEWA and EWA online

Could you send me a link so I can see what these are?

funderburkjim commented 7 years ago

@gasyoun Thanks for the dcs link. Looks like there is a lot of good work there.

In the sentence analysis, I noticed that there is no 'analyzed sandhi' section. Do you know if that it available but just not printed, or is it currently unavailable. Do you know how the analysis was done?

funderburkjim commented 7 years ago

Here are some further comments by this user (in response to email correspondence). Incidentally, I've asked him if he wants to join this Github project:

In fact, I think the criterion for case-insensitive search which already exists in the old version is the convenient one: http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/tamil/index.html

I feel that as far as the search is concerned, Sanskrit ṣ can be regarded as the same category as s, i.e. s=ṣ (despite clearly having a different pronunciation), since they very frequently have the same etymology. As you know, the ṣ present in Ai. dhiṣṇya- is the s we find in Lat. fēriae (older fēsiae), festa and has been lost in fānum with compensatory vowel lengthening. The case is similar as the one we find (with a different sound allomorph) in English sun /s/, German Sonne /z/ and Dutch zon /z/. All three words have the same etymology. Regarding zon, Dutch speakers decided to use a different character to reflect the same sound as the one which is written in German the same way as the /s/ of Maus. Following this reasoning I see convenient to consider Sanskrit s and ṣ as the same category in the search machine. However, aṣṭā́(u) - ‘8’ is a case where ṣ does not come from s, but from ḱ.

Sanskrit ṭ ḍ ṇ ṃ sometimes come from pure t d n m sounds (e.g. pṛṣṭhá-m - ‘back’), sometimes come from consonant clusers _r̥C, * l̥C > C (e.g. paṭa <_pl̥ta-). In any case, treating cerebrals with non cerebrals together makes the search easier.

I would say that Sanskrit z/ś is a sound completely different from s, since the former reproduces proto-indoeuropean *ḱ, as is the case in śatám. I think it is better to treat z and s separately at the search machine, but anyway this is a matter of personal choices.

Regarding the distinction between adjectives, nouns and verbs, the issue is by no means easy. Some adjectives can also become nouns and vice-versa. In some cases we have a contradiction, as in hvAnIya, where we find an infinitive (and thus a noun) classified as an adjective. Perhaps it is a problem of the original text rather than from the computer edition.

With a view to a middle-term expansion of the database with etymological entries from Mayrhofer, I just know a bit of Java. Perhaps this can help but I would need to have an electronic version of the Mayrhofer dictionary, which for the time being does not exist.

gasyoun commented 7 years ago

Following this reasoning I see convenient to consider Sanskrit s and ṣ as the same category in the search machine.

Sure.

In any case, treating cerebrals with non cerebrals together makes the search easier.

Yes!

I would say that Sanskrit z/ś is a sound completely different from s, since the former reproduces proto-indoeuropean *ḱ, as is the case in śatám. I think it is better to treat z and s separately at the search machine, but anyway this is a matter of personal choices.

For search engine I would have ś ṣ s all equal (as option). It's not about etymology, what you try to do is smarter than needed.

Some adjectives can also become nouns and vice-versa

So I would have 1 category for all adjectives and nouns in 1 bucket. The questions is about verbs and non-verbs.

Perhaps this can help but I would need to have an electronic version of the Mayrhofer dictionary, which for the time being does not exist.

It exists, as part of https://www.universiteitleiden.nl/en/research/research-projects/humanities/indo-european-etymological-dictionary - Lubotsky told me in 2006.

gasyoun commented 7 years ago

Let's move the greek letters a bit, so we can easily see who is above whom.

graha

http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/web/webtc/indexcaller.php graha

funderburkjim commented 7 years ago

Good suggestion.

I also think we should change the pwg.txt digitization section markup:

²b) {%das Ergriffene%} u. s. w.: ¹a) {%Beute%} ¯{¤MBH. 3, 11461.¤} {#Syeno grahAluYcane#} ¯{¤MR2K4K4H. 50, 15.¤} -- ¹b) {%haustus, das was mit dem 

¹a)  -> ¹α),  ¹b) -> ¹β)   etc.

I'm not sure about whether to leave the superscripts 1 and 2, and there's also a superscript 3 for
(Number subhead = ³1) ³2) etc..  An alternative might be to introduce some xml-type tags - this might
be less obscure than the superscripts.

I would also like to change the digitization so that lines are not so long. The pwg digitization does not represent the printed text line breaks, as has been mentioned.
But Some more rational system of lines within the pwg.txt file would make it easier to work with (some so-called lines in pwg.txt may be many thousands of characters in length).

gasyoun commented 7 years ago

The pwg digitization does not represent the printed text line breaks, as has been mentioned.

Oh, missed that.

But Some more rational system of lines within the pwg.txt file would make it easier to work with

But there is no easy way to deal with it. Or just add mechanical breaks?

funderburkjim commented 7 years ago

The best way would be to manually insert a marker where the text line breaks occur; but this is too time-consuming a task to undertake.

Thus, some programmatically feasible approach would be taken. Some desiderata of the end result might be:

have each subsection begin on a new line
have markup (such as Sanskrit text - words and quotes) be contained within a line
have literary source markup be contained within a line.

This file is a display of the distribution of line lengths in the current pwg.txt.

gasyoun commented 7 years ago

27.9% of non-empty lines have length < 100 characters. 8.24% have lengths in range 90-99 characters

Is quite actionable.

have each subsection begin on a new line

Right and if we count how many characters per line, then we can add pseudo-line breaks.

gasyoun commented 7 years ago

Missing in http://www.sanskrit-lexicon.uni-koeln.de/scans/PWGScan/2013/web/webtc/indexcaller.php as well.

funderburkjim commented 7 years ago

The display for PW has now been adjusted:

@gasyoun Is this what you had in mind?

funderburkjim commented 7 years ago

PWG is similarly altered.

The only thing I see as suboptimal in the above is as with '--Beta) Planet ...' where the long text doesn't also indent. I've only indented the start of the subsection (using  ). Probably there is some way to indent the whole subsection with css.
I'm not sure it's worthwhile to take the time to understand how to make this further adjustment.

gasyoun commented 7 years ago

Is this what you had in mind?

Thanks, yes.

Probably there is some way to indent the whole subsection with css.

Sure, let me check if I do not forget.

graha

vgl. (Greek) β. Die Planeten

why is (Greek) still there, if nothing is missing? Remove it? Hope we can retain the markup, but have it shown in a different way, otherwise I start to see what to correct and there is nothing to be corrected.

funderburkjim commented 7 years ago

The (Greek) is removed.

but have it shown in a different way,

The xml markup remains (in pwg.xml, as <lang n="Greek">β</lang>.

After the change to the display, there is no visual distinction in the html for Greek.

Do you need the Greek (and Arabic and Russian and OldHebrew) to be visually distinctive in the html displays?

gasyoun commented 7 years ago

Do you need the Greek (and Arabic and Russian and OldHebrew) to be visually distinctive in the html displays?

No. Enough that they are marked in the code. Too bad that only they are.

gasyoun commented 7 years ago

@funderburkjim , let's remove the (OldHebrew) בני אלים, (OldHebrew) בני אלהים and similar.

gasyoun commented 7 years ago

@funderburkjim issue still there, language names need to me removed in display.

greek

funderburkjim commented 7 years ago

the 'who is above who' suggestion.

To accomplish this, we'll need to enhance the PWG markup, similarly to the way the <div> markup was added to ap.xml, as discussed in #113.

funderburkjim commented 7 years ago

(Greek) ... (OldHebrew)

Modified Basic and other displays to avoid showing the language name.

This is part of disp.php that was modified:

  } else if ($el == "lang"){
   $n = $attribs['n'];
   if ($n == 'Russian') {
    // nothing to do
   }else if ($n == 'Arabic'){ 
    //$row .= "<span style='background-color:yellow'>"; // Temporary April 8, 2015
    $row .= "<span>";
   }else {
    //$row .= "<span class='lang'>($n) ";
    // 04-19-2017. Removed showing language name   <<<<<<<<<<<<<<<<
    $row .= "<span class='lang'>";  
   }

gasyoun commented 7 years ago

To accomplish this, we'll need to enhance the PWG markup, similarly to the way the
markup was added to ap.xml

Oh, that's not a small task, got it.

Modified Basic and other displays to avoid showing the language name.

Hurray!

sanskrit-lexicon / COLOGNE