Closed drdhaval2785 closed 7 years ago
Seems to be related to the recent discussion (can't locate) when the lines crossing line breaks have been given separate <s>
tags.
what is the ap headword where this occurs?
afRin
One potential enhancement which is quite useful.
AP has markings for layer of information. They can be indented to give proper display.
Look at superscript 2 and superscript 3.
afRin problem
Problem solved as follows.
.{#afRin#}¦ {%a.%} (epic) ({#f#} being here regarded as a consonant) Not
a debtor, free from debt; {#divasasyAzwame BAge SAkaM pacati yo naraH #}
.{# afRI cApravAsI ca sa vAricara modate ..#} Mb. The normal form {#anfRin#}
also occurs in this sense.
.{#afRin#}¦ {%a.%} (epic) ({#f#} being here regarded as a consonant) Not
a debtor, free from debt; {#divasasyAzwame BAge SAkaM pacati yo naraH .#}
{# afRI cApravAsI ca sa vAricara modate ..#} Mb. The normal form {#anfRin#}
also occurs in this sense.
Change display of subsection.
It will be easier to change ap.xml, since this is done by a PYthon program which can handle the Unicode characters (superscripts 1,2,3) more reliably.
Probably will add markup so .²1
-> <div n="2">1</div>
, and similarly for others (Something like this is done with PW, as I recall).
The question then will be how to render the new markup in the displays.
Suggestions?
Probably will add markup so .²1 ->
1, and similarly for others (Something like this is done with PW, as I recall). The question then will be how to render the new markup in the displays.
Make it close to book visually? Bolded numbers.
Make it close to book visually? Bolded numbers.
Those were the days when everything was printed. Cost saving by reducing spaces was a concern. Now times have changed. We can properly indent them for easier readability now. No need to retrogress.
No need to retrogress.
Bolded numbers are no retro. They catch the eye.
For bolded numbers, I am with you. But the overall display should have appropriate indentations.
A version of ap.xml, with related changes to the display, is now ready for your viewing enjoyment.
The changes are evident in the basic, list, etc. displays [Not giving links because of the restricted status of the dictionary -- I'm assuming the interested parties know how to navigate to these displays.]
However, for the sake of allowing comparisons to the previous version, I haven't yet installed the changes in list-02.html display.
Take a look, and give me feedback.
When there is general agreement on the changes, I'll finish installation steps, and describe some details of the process.
Jim, has these "glued" words always existes or it's something new? In dA
toexchange
one'slife
sometimes
Otherwise it's much better.
Still I have a long pending question.
I do not like what I see on the left, the way numbers are presented. Nobody (whom I asked) did not understand what is L or p. I would suggest mark them different colours and remove the L and p tags. And make p one line with no break, like http://stackoverflow.com/questions/7219007/html-no-line-break-at-hyphens
When entry spans more than one line, next line starts at the same indent as the numbers. See
I would like something like this. See how the 1,2, etc stand out of crowd.
I agree, Google's spacing is well thought.
The 'indent' question is one I struggled with.
I first tried the css text-indent
property. But ended up using a 'position:relative; left:2em;' style.
I think that the hanging
part of text-indent might give the feature you suggest, but this is not implemented in browsers currently, acc. to MDN, and according to my experiments trying it.
If you know can show me how to implement the indentation style your image shows, I'll be glad to use it.
Regarding the 'L=', etc. comments.
I also think the current format is awkward.
Currently, the whole part of the basic display is a table with 2 columns; the 'key1, L=,p=' part is in the first column, and the main entry is in second column.
What about making it just one column, and changing the labeling. [Idea implmented experimentally -- take a look.]
toexchange
This is a bug in the revised make_xml.py program. Bug now corrected. Good catch! 👍
@funderburkjim
First few lines in ap.txt
.{#a#}¦ The first letter of the alphabet; {#akzarARAmakAro'smi#} Bg. 10. 33.
.{#{@-aH@}#} [{#avati, atati sAtatyena tizWatIti vA; av--at vA, qa#} Tv.]
Whereas it is rendered in ap.xml as
<H1><h><key1>a</key1><key2>a</key2></h><body><s>a</s> The first letter of the alphabet; <s>akzarARAmakAro'smi</s> Bg. 10. 33.<lb/>.<b><s>-aH</s></b> [<s>avati, atati sAtatyena tizWatIti vA; av--at vA, qa</s> Tv.]<lb/>
Have a look at <lb/>.<b>
.
A superfluous period here.
.{#
signifies starting of a chunk in AP. No need to keep the period there.
@funderburkjim what is your take?
Euro character is not killed. It identifies verbs, but we should identify verb numbers and tag them in XML and not keep euro character.
<H1><h><key1>aMh</key1><key2>aMh</key2></h><body><s>aMh</s> €1A <s>aMhate, aMhituM</s> To go; approach; set out; Bk. 3. 25,<lb/>46, <s>AnaMhe cAntikaM pituH</s> 14. 51, 4. 4. &c. <i>-Caus.</i><lb/>.²1 To send; <s>tamAYjihanmETilayajYaBUmiM</s> Bk. 2. 40, 15. 75.<lb/>.²2 To shine.<lb/>.²3 To speak.</body><tail><L>20</L><pc>0002-2</pc></tail></H1>
See €1A
.{#
and new xml structure
Here is the first part of headword 'a` in the revised xml:
<H1><h><key1>a</key1><key2>a</key2></h><body><s>a</s> The first letter of the alphabet;
<s>akzarARAmakAro'smi</s> Bg. 10. 33.
<div n="?"><b><s>-aH</s></b> [<s>avati, atati sAtatyena tizWatIti vA; av--at vA, qa</s> Tv.]</div>
<div n="2" name="1">1 N. of Viṣṇu, the first of the three sounds constituting the sacred syllable <s>om; akAro vizRuruddizwa ukArastu maheSvaraH . makArastu smfto</s> <s>brahmA praRavastu trayAtmakaH ..</s> For more explanations of the three syllables <s>a, u, m</s> see <s>om</s>.</div>
I think this 'div' structure takes care of your concerns there. @drdhaval2785 Agree?
€ and roots
There are 3068 lines matching €
, and these do appear to be roots.
As usual, there are multiple forms that need to be identified, and some likely errors also.
We could employ an xml markup similar to that of MW; here's a sample of MW under hw aMS
<vlex type="root"></vlex> <vlex>cl.10 P.</vlex>
And here is the full record for the root aMh in MW:
<H1><h><hc3>503</hc3><key1>aMh</key1><hc1>1</hc1><key2>aMh</key2><hom>1</hom></h>
<body> <vlex type="root"></vlex> <p><cf/>~<root/>~<s>aNG</s></p> <vlex>cl.1 A1.</vlex>
<s>aMhate</s> , <c><to/>to_go_,_set_out_,_commence</c> <ls>L.</ls> <msc/> <c>
<to/>to_approach</c> <ls>L.</ls> <msc/> <vlex>cl.10 P.</vlex> <s>aMhayati</s> , <c>
<to/>to_send</c> <ls>Bhat2t2.</ls> <msc/> <c><to/>to_speak</c> <ls>Bhat2t2.</ls> <msc/> <c>
<to/>to_shine</c> <ls>L.</ls> </body><tail><mul/> <MW>000068</MW> <pc>1,2</pc> <L>107</L><mscverb/></tail></H1>
In AP case, we could render €1A
as <vlex>1A</vlex>
in ap.xml.
While most cases are simple like this, it will take some work to completely handle all relevant cases. Here are some other forms that catch my eye:
.{#Acakz#}¦ €2Ā. Notice the Macron on the A
.{#AGf#}¦ €10P. or {%Caus.%} To pour down upon, sprinkle. should the 'Caus' be in scope of <vlex>?
.{#ArAD#}¦ €5, 10 P.
.{#akz#}¦ €15P. Probably should be at least 1,5P (with added comma)
I have so many things on my todo list that I am reluctant to volunteer to do the needed work to add this markup to the xml form of ap.
Since no additional comments regarding the revised ap displays (in particular regarding the adjustment to the handling of [p=123][L=345]
), I'll assume that it is safe to go ahead and complete the full installation of the current revised xml, and the corresponding revisions to the displays.
<div>
markup to ap.xmlThe ap.txt digitization has a form of markup for sections. This ap.txt markup is identified by lines that start with a period. Before markup can be added to ap.xml, it is necessary to classify the various types of lines that start with '.'; and correct mistakes that impede the classification.
.{#a#}¦
The broken bar is part of the identification of this class. These do
not require additional markup.²1 N. of Viṣṇu ...
these are of form .²
+ digit-sequence.³({%a%}) N. of Viṣṇu ...
these are of form .³
+ ({%X%})
, i.e., an italicized letter in parentheses..{@{#-aH#}@}
(not a headword, since no broken bar.To decide the superscript cases, the ap.txt was filtered using re.search(u'(.)?([²³][^ ]*)',line)
,
and the categories printed out. See filter_test1_cases_orig.txt here.
As you see, there are some garbagey looking cases. The most labor-intensive part of the exercise is to identify and correct these. The corrections are in the 'superscript_changes.txt' file of the gist just mentioned.
After correction, the set of superscript cases is quite regular, as described above; this list is in the 'filter_test1_cases.txt' file of the gist. We now are ready to change these ap.txt markups to ap.xml markups.
This is done by the 'make_xml.py' program, and the process involved two phases:
div
tag: Accomplished in the adjust_xml function.
.²5
-> <div n="2" name="5">5
.³({%a%})
-> <div n="3" name="a">({%a%})
.{@{#-aH#}@}
-> <div n="?">
{@{#-aH#}@}`
</div>
tag. This is done by the close_divs function.
We want this to be at the end of the scope of the opening
tag. The easiest way is to say the the scope of an opening <div>
goes all the way up to the next
opening <div>
.
I think this 'div' structure takes care of your concerns there. @drdhaval2785 Agree?
Yes
There probably is more hierarchical structure than this markup choice identifies. We just have implemented a simple sequence of div elements, and no div element is a 'child' of another div element.
There may be some hierarchy uncaptured. But at least we made a start. Slowly we will inch there too.
simple sequence of div elements
Is sufficient to properly indent the display. Good leap in readability and user friendliness.
The html rendering of the <div>
elements is done in the disp.php program of web/webtc .
This program is used directly by the 'basic' display. Since the other displays (list display, advanced search, mobile-friendly, apidev/sample/list-02.html, etc.) piggy-back on disp.php, the change is reflected in all the displays.
This is done by rewriting the xml at the start of each div; here the $line
variable contains the entire xml line: <H1>...<body>...<div..>W ...<\div>...</body>...</H1>
and we replace W
by <b>W</b>
.
$line = preg_replace('|(<div[^>]*>)(\(<i>.</i>\))|','\\1<b>\\2</b>',$line);
$line = preg_replace('|(<div[^>]*>)([0-9]+)|','\\1<b>\\2</b>',$line);
Note that this applies to the two superscript cases of ap.txt. For the other div case, we assume the element is already bold.
For the 'n=2' superscript, we indent by 1em
and for the 'n=3' superscript, by 2em
. The other kind of div is not indented.
} else if ($el == "div") {
$n=$attribs['n'];
if ($n == '3') {
$style="position:relative; left:2em;";
$row .= "<br/><span style='$style'>";
}else if ($n == '2') {
$style="position:relative; left:1em;";
$row .= "<br/><span style='$style'>";
}else {
$style="";
//$row .= "<p style='$style'>";
$row .= "<br/><span style='$style'>";
}
This occurs in the context of a SAX xml parser, using the xml_parser_create and related functions of PHP; this is a php version of the expat parser for xml. It is likely that there is an expat parser that could be used in a browser's Javascript code to do all this rendering in the browser.
Anyway, the php parser essentially does a tree-walk of the xml structure, and when it encounters a <div>
element, it examines the n
attribute value, and based on this value (2,3 or ?) constructs a style element that introduces an extra 'left' indentation to the span element that contains the subsequent text of the <div>
.
Then when the end of the <div>
is encountered later in the tree walk, the closing </span>
element is inserted into the html stream under construction.
In an earlier comment above, I suggested using an empty div element for the xml markup. However, I changed this to enclose the entire division
<div...>text of the div</div>
, and a major reason was so that the rendering method above would know where to put that closing</span>
element.
That's it. Not too hard, once all the context is understood.
In AP case, we could render €1A as
1A in ap.xml.
Agree.
.{#AGf#}¦ €10P. or {%Caus.%} To pour down upon, sprinkle. should the 'Caus' be in scope of
?
Should not. For Causative forms even some general verbal tag would not do? Sure it's better than nothing, but as Caus. with the abbreviature occur in many dictionaries, it could be used for RegExing them out and giving them what they deserve.
.{#akz#}¦ €15P. Probably should be at least 1,5P (with added comma)
Yeah, indeed, plenty of issues.
There was an annoying side-effect of the indentation, when viewed in the list-0.2.html display. Namely, part of the lines were hidden under the scrollbar.
This is no doubt due to the relative positioning technique used for indentation.
To improve this, a 'padding-right:15px' css rule was added to the list-0.2 display
tech note: in file apidev/css/basic.css, at
#CologneBasic table.display
rule.
This improves the situation for ap, and simply adds a little space at the right for other displays.
Everything now installed.
(in particular regarding the adjustment to the handling of [p=123][L=345] )
p and L tell nothing to nobody. I know them, but none of my students could not grasp it wihtout me telling what it is.
There may be some hierarchy uncaptured. But at least we made a start. Slowly we will inch there too.
Yeah.
For the 'n=2' superscript, we indent by 1em and for the 'n=3' superscript, by 2em. The other kind of div is not indented.
I would add a that is not n=2. I would add a CSS. And in the CSS file I could change and play around and see if 2 is a appropriate choice.
To improve this, a 'padding-right:15px' css rule was added to the list-0.2 display
Makes sense, indeed. Let me write my proposals in a new thread.
p and L ...
Did you notice the change in AP display? Isn't this better?
Isn't this better?
Do not think so.
[record id=394] [scan-page 0030-2]
1) are not obvious as well. Is the record ID in the book, where to find him? What is the -2
in the number?
2) I would make it like 0030-2 - move the mouse over and see the title
attribute I've used. We can have a longer explanation there. The only thing is that the ID should be made as a link as well. At least a link to # can be made, so a fake one - no big issue. And I would add a CSS class with a brighter colour, not black for these numbers.
As per me, if explained in a FAQ and in the tooltip for each link, it's ok to have:
[ID 394] [P 0030-2]
Do not think so.
Sorry to hear that. Will add your suggestion to todo list (currently 7 deep).
Sorry to hear that.
It's not about me. I understand, sure. But it's from what I've seen how @Shalu411 used it initially and people for whom English is not a mother language. The abbreviations are not obvious and need explanation. Even if they are longer, a commentary and intro is wanted.
<s>
tag. See</s><lb/>.<s>
<s>divasasyAzwame BAge SAkaM pacati yo naraH </s><lb/>.<s> afRI cApravAsI ca sa vAricara modate ..</s>
This gives rise to the period being seen in the web display / stardict conversion too.
Verified from Web display to have period.