sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

ap.xml issues #113

Closed drdhaval2785 closed 7 years ago

drdhaval2785 commented 7 years ago
  1. daRqa sent out of the <s> tag. See </s><lb/>.<s>

<s>divasasyAzwame BAge SAkaM pacati yo naraH </s><lb/>.<s> afRI cApravAsI ca sa vAricara modate ..</s>

This gives rise to the period being seen in the web display / stardict conversion too.

Verified from Web display to have period.

drdhaval2785 commented 7 years ago

Seems to be related to the recent discussion (can't locate) when the lines crossing line breaks have been given separate <s> tags.

funderburkjim commented 7 years ago

what is the ap headword where this occurs?

drdhaval2785 commented 7 years ago

afRin

drdhaval2785 commented 7 years ago

One potential enhancement which is quite useful.

AP has markings for layer of information. They can be indented to give proper display.

Look at superscript 2 and superscript 3. tmp_3740-screenshot_20170408-0902341047081210

funderburkjim commented 7 years ago

afRin problem

Problem solved as follows.

funderburkjim commented 7 years ago

Change display of subsection.

It will be easier to change ap.xml, since this is done by a PYthon program which can handle the Unicode characters (superscripts 1,2,3) more reliably.

Probably will add markup so .²1 -> <div n="2">1</div>, and similarly for others (Something like this is done with PW, as I recall).

The question then will be how to render the new markup in the displays.

Suggestions?

gasyoun commented 7 years ago

Probably will add markup so .²1 ->

1
, and similarly for others (Something like this is done with PW, as I recall). The question then will be how to render the new markup in the displays.

Make it close to book visually? Bolded numbers.

bolded

drdhaval2785 commented 7 years ago

Make it close to book visually? Bolded numbers.

Those were the days when everything was printed. Cost saving by reducing spaces was a concern. Now times have changed. We can properly indent them for easier readability now. No need to retrogress.

gasyoun commented 7 years ago

No need to retrogress.

Bolded numbers are no retro. They catch the eye.

drdhaval2785 commented 7 years ago

For bolded numbers, I am with you. But the overall display should have appropriate indentations.

funderburkjim commented 7 years ago

improved version ready for review

A version of ap.xml, with related changes to the display, is now ready for your viewing enjoyment.

The changes are evident in the basic, list, etc. displays [Not giving links because of the restricted status of the dictionary -- I'm assuming the interested parties know how to navigate to these displays.]

However, for the sake of allowing comparisons to the previous version, I haven't yet installed the changes in list-02.html display.

Take a look, and give me feedback.

When there is general agreement on the changes, I'll finish installation steps, and describe some details of the process.

gasyoun commented 7 years ago

Jim, has these "glued" words always existes or it's something new? In dA

toexchange
one'slife
sometimes

Otherwise it's much better.

give

Still I have a long pending question.

lp

I do not like what I see on the left, the way numbers are presented. Nobody (whom I asked) did not understand what is L or p. I would suggest mark them different colours and remove the L and p tags. And make p one line with no break, like http://stackoverflow.com/questions/7219007/html-no-line-break-at-hyphens

drdhaval2785 commented 7 years ago

When entry spans more than one line, next line starts at the same indent as the numbers. See tmp_6948-screenshot_20170413-1319541676508781

drdhaval2785 commented 7 years ago

I would like something like this. See how the 1,2, etc stand out of crowd.

tmp_6948-screenshot_20170413-1321331574285965

gasyoun commented 7 years ago

I agree, Google's spacing is well thought.

funderburkjim commented 7 years ago

The 'indent' question is one I struggled with.

I first tried the css text-indent property. But ended up using a 'position:relative; left:2em;' style.

I think that the hanging part of text-indent might give the feature you suggest, but this is not implemented in browsers currently, acc. to MDN, and according to my experiments trying it.

If you know can show me how to implement the indentation style your image shows, I'll be glad to use it.

funderburkjim commented 7 years ago

Regarding the 'L=', etc. comments.

I also think the current format is awkward.

Currently, the whole part of the basic display is a table with 2 columns; the 'key1, L=,p=' part is in the first column, and the main entry is in second column.

What about making it just one column, and changing the labeling. [Idea implmented experimentally -- take a look.]

funderburkjim commented 7 years ago

toexchange

This is a bug in the revised make_xml.py program. Bug now corrected. Good catch! 👍

drdhaval2785 commented 7 years ago

@funderburkjim

First few lines in ap.txt

.{#a#}¦ The first letter of the alphabet; {#akzarARAmakAro'smi#}  Bg. 10. 33.
.{#{@-aH@}#} [{#avati, atati sAtatyena tizWatIti vA; av--at vA, qa#} Tv.]

Whereas it is rendered in ap.xml as

<H1><h><key1>a</key1><key2>a</key2></h><body><s>a</s>   The first letter of the alphabet; <s>akzarARAmakAro'smi</s>  Bg. 10. 33.<lb/>.<b><s>-aH</s></b> [<s>avati, atati sAtatyena tizWatIti vA; av--at vA, qa</s> Tv.]<lb/>

Have a look at <lb/>.<b>. A superfluous period here. .{# signifies starting of a chunk in AP. No need to keep the period there. @funderburkjim what is your take?

drdhaval2785 commented 7 years ago

Euro character is not killed. It identifies verbs, but we should identify verb numbers and tag them in XML and not keep euro character.

<H1><h><key1>aMh</key1><key2>aMh</key2></h><body><s>aMh</s>   €1A <s>aMhate, aMhituM</s> To go; approach; set out; Bk. 3. 25,<lb/>46, <s>AnaMhe cAntikaM pituH</s> 14. 51, 4. 4. &amp;c. <i>-Caus.</i><lb/>.²1 To send; <s>tamAYjihanmETilayajYaBUmiM</s> Bk. 2. 40, 15. 75.<lb/>.²2 To shine.<lb/>.²3 To speak.</body><tail><L>20</L><pc>0002-2</pc></tail></H1>

See €1A

funderburkjim commented 7 years ago

.{# and new xml structure

Here is the first part of headword 'a` in the revised xml:

<H1><h><key1>a</key1><key2>a</key2></h><body><s>a</s>   The first letter of the alphabet; 
<s>akzarARAmakAro'smi</s>  Bg. 10. 33. 

<div n="?"><b><s>-aH</s></b> [<s>avati, atati sAtatyena tizWatIti vA; av--at vA, qa</s> Tv.]</div> 

<div n="2" name="1">1 N. of Viṣṇu, the first of the three sounds constituting the sacred syllable <s>om; akAro vizRuruddizwa ukArastu maheSvaraH . makArastu smfto</s> <s>brahmA praRavastu trayAtmakaH ..</s> For more explanations of the three syllables <s>a, u, m</s> see <s>om</s>.</div> 

I think this 'div' structure takes care of your concerns there. @drdhaval2785 Agree?

funderburkjim commented 7 years ago

€ and roots

There are 3068 lines matching , and these do appear to be roots.

As usual, there are multiple forms that need to be identified, and some likely errors also.

We could employ an xml markup similar to that of MW; here's a sample of MW under hw aMS

<vlex type="root"></vlex> <vlex>cl.10 P.</vlex> 

And here is the full record for the root aMh in MW:
<H1><h><hc3>503</hc3><key1>aMh</key1><hc1>1</hc1><key2>aMh</key2><hom>1</hom></h>
<body> <vlex type="root"></vlex> <p><cf/>~<root/>~<s>aNG</s></p> <vlex>cl.1 A1.</vlex> 
<s>aMhate</s> , <c><to/>to_go_,_set_out_,_commence</c> <ls>L.</ls> <msc/> <c>
<to/>to_approach</c> <ls>L.</ls> <msc/> <vlex>cl.10 P.</vlex> <s>aMhayati</s> , <c>
<to/>to_send</c> <ls>Bhat2t2.</ls> <msc/> <c><to/>to_speak</c> <ls>Bhat2t2.</ls> <msc/> <c>
<to/>to_shine</c> <ls>L.</ls> </body><tail><mul/> <MW>000068</MW> <pc>1,2</pc> <L>107</L><mscverb/></tail></H1>

In AP case, we could render €1A as <vlex>1A</vlex> in ap.xml.

While most cases are simple like this, it will take some work to completely handle all relevant cases. Here are some other forms that catch my eye:

.{#Acakz#}¦ €2Ā.   Notice the Macron on the A
.{#AGf#}¦ €10P. or {%Caus.%} To pour down upon, sprinkle.  should the 'Caus' be in scope of <vlex>?
.{#ArAD#}¦ €5, 10 P.
.{#akz#}¦ €15P.      Probably should be at least 1,5P  (with added comma)

I have so many things on my todo list that I am reluctant to volunteer to do the needed work to add this markup to the xml form of ap.

funderburkjim commented 7 years ago

Since no additional comments regarding the revised ap displays (in particular regarding the adjustment to the handling of [p=123][L=345] ), I'll assume that it is safe to go ahead and complete the full installation of the current revised xml, and the corresponding revisions to the displays.

funderburkjim commented 7 years ago

documentation of adding <div> markup to ap.xml

The ap.txt digitization has a form of markup for sections. This ap.txt markup is identified by lines that start with a period. Before markup can be added to ap.xml, it is necessary to classify the various types of lines that start with '.'; and correct mistakes that impede the classification.

four categories of lines in ap.txt beginning with periods:

To decide the superscript cases, the ap.txt was filtered using re.search(u'(.)?([²³][^ ]*)',line), and the categories printed out. See filter_test1_cases_orig.txt here.

As you see, there are some garbagey looking cases. The most labor-intensive part of the exercise is to identify and correct these. The corrections are in the 'superscript_changes.txt' file of the gist just mentioned.

After correction, the set of superscript cases is quite regular, as described above; this list is in the 'filter_test1_cases.txt' file of the gist. We now are ready to change these ap.txt markups to ap.xml markups.

Converting to markup in ap.xml

This is done by the 'make_xml.py' program, and the process involved two phases:

drdhaval2785 commented 7 years ago

I think this 'div' structure takes care of your concerns there. @drdhaval2785 Agree?

Yes

drdhaval2785 commented 7 years ago

There probably is more hierarchical structure than this markup choice identifies. We just have implemented a simple sequence of div elements, and no div element is a 'child' of another div element.

There may be some hierarchy uncaptured. But at least we made a start. Slowly we will inch there too.

drdhaval2785 commented 7 years ago

simple sequence of div elements

Is sufficient to properly indent the display. Good leap in readability and user friendliness.

funderburkjim commented 7 years ago

documentation of the html rendering of the divs

The html rendering of the <div> elements is done in the disp.php program of web/webtc .

This program is used directly by the 'basic' display. Since the other displays (list display, advanced search, mobile-friendly, apidev/sample/list-02.html, etc.) piggy-back on disp.php, the change is reflected in all the displays.

step 1. make the first 'word' of the div bold

This is done by rewriting the xml at the start of each div; here the $line variable contains the entire xml line: <H1>...<body>...<div..>W ...<\div>...</body>...</H1> and we replace W by <b>W</b>.

 $line = preg_replace('|(<div[^>]*>)(\(<i>.</i>\))|','\\1<b>\\2</b>',$line);
 $line = preg_replace('|(<div[^>]*>)([0-9]+)|','\\1<b>\\2</b>',$line);

Note that this applies to the two superscript cases of ap.txt. For the other div case, we assume the element is already bold.

step 2. indentation

For the 'n=2' superscript, we indent by 1em and for the 'n=3' superscript, by 2em. The other kind of div is not indented.

  } else if ($el == "div") {
   $n=$attribs['n'];
   if ($n == '3') {
    $style="position:relative; left:2em;";
    $row .= "<br/><span style='$style'>";
   }else if ($n == '2') {
    $style="position:relative; left:1em;";
    $row .= "<br/><span style='$style'>";
   }else {
    $style="";
    //$row .= "<p style='$style'>";
    $row .= "<br/><span style='$style'>";
   }

This occurs in the context of a SAX xml parser, using the xml_parser_create and related functions of PHP; this is a php version of the expat parser for xml. It is likely that there is an expat parser that could be used in a browser's Javascript code to do all this rendering in the browser.

Anyway, the php parser essentially does a tree-walk of the xml structure, and when it encounters a <div> element, it examines the n attribute value, and based on this value (2,3 or ?) constructs a style element that introduces an extra 'left' indentation to the span element that contains the subsequent text of the <div>.

Then when the end of the <div> is encountered later in the tree walk, the closing </span> element is inserted into the html stream under construction.

In an earlier comment above, I suggested using an empty div element for the xml markup. However, I changed this to enclose the entire division <div...>text of the div</div>, and a major reason was so that the rendering method above would know where to put that closing </span> element.

That's it. Not too hard, once all the context is understood.

gasyoun commented 7 years ago

In AP case, we could render €1A as 1A in ap.xml.

Agree.

.{#AGf#}¦ €10P. or {%Caus.%} To pour down upon, sprinkle. should the 'Caus' be in scope of ?

Should not. For Causative forms even some general verbal tag would not do? Sure it's better than nothing, but as Caus. with the abbreviature occur in many dictionaries, it could be used for RegExing them out and giving them what they deserve.

.{#akz#}¦ €15P. Probably should be at least 1,5P (with added comma)

Yeah, indeed, plenty of issues.

funderburkjim commented 7 years ago

adjustment to css of list-0.2.html display

There was an annoying side-effect of the indentation, when viewed in the list-0.2.html display. Namely, part of the lines were hidden under the scrollbar.

This is no doubt due to the relative positioning technique used for indentation.

To improve this, a 'padding-right:15px' css rule was added to the list-0.2 display

tech note: in file apidev/css/basic.css, at #CologneBasic table.display rule.

This improves the situation for ap, and simply adds a little space at the right for other displays.

funderburkjim commented 7 years ago

Everything now installed.

gasyoun commented 7 years ago

(in particular regarding the adjustment to the handling of [p=123][L=345] )

p and L tell nothing to nobody. I know them, but none of my students could not grasp it wihtout me telling what it is.

There may be some hierarchy uncaptured. But at least we made a start. Slowly we will inch there too.

Yeah.

For the 'n=2' superscript, we indent by 1em and for the 'n=3' superscript, by 2em. The other kind of div is not indented.

I would add a that is not n=2. I would add a CSS. And in the CSS file I could change and play around and see if 2 is a appropriate choice.

To improve this, a 'padding-right:15px' css rule was added to the list-0.2 display

Makes sense, indeed. Let me write my proposals in a new thread.

funderburkjim commented 7 years ago

p and L ...

Did you notice the change in AP display? Isn't this better? image

gasyoun commented 7 years ago

Isn't this better?

Do not think so.

[record id=394] [scan-page 0030-2]

1) are not obvious as well. Is the record ID in the book, where to find him? What is the -2 in the number? 2) I would make it like 0030-2 - move the mouse over and see the title attribute I've used. We can have a longer explanation there. The only thing is that the ID should be made as a link as well. At least a link to # can be made, so a fake one - no big issue. And I would add a CSS class with a brighter colour, not black for these numbers.

As per me, if explained in a FAQ and in the tooltip for each link, it's ok to have: [ID 394] [P 0030-2]

funderburkjim commented 7 years ago

Do not think so.

Sorry to hear that. Will add your suggestion to todo list (currently 7 deep).

gasyoun commented 7 years ago

Sorry to hear that.

It's not about me. I understand, sure. But it's from what I've seen how @Shalu411 used it initially and people for whom English is not a mother language. The abbreviations are not obvious and need explanation. Even if they are longer, a commentary and intro is wanted.