Open drdhaval2785 opened 4 years ago
m. (-मः), f. (-मा), n. (-मं).
You think end of one is beginning of another?
For reference, here are the pieces that would be involved in adding the markup suggested above.
<div n="X">
in
xxx.xml is converted to html in the displays. The interpretation depends on the dictionary.
Search for wil
to see how div
markup interpreted in Wilson dictionary.<L>32263<pc>704<k1>rAma<k2>rAma
{#rAma#}¦ mfn. ({#-maH-mA-maM#})
.²1 Black.
.²2 White.
.²3 Beautiful, pleasing. m. ({#-maH#})
.²1 A name common to three incarnations of VIṢṆU, or PARAŚURĀMA, the son of
the {%Muni%} JAMADAGNI, born at the commencement of the second or {%Tretā%}
{%Yug,%} for the purpose of punishing the tyrannical kings of the {%Kṣatriya%}
race; RĀMACANDRA, the son of DAŚARATHA, king of {%Oude,%} born at the close
of the second age, to destroy the demons who infested the earth, and especially
RĀVAṆA the {%Daitya%} sovereign of {%Ceylon;%} and BALARĀMA, the elder and
half-brother of KṚṢṆA, the son of ROHIṆĪ, born at the end of the
{%Dvāpara%} or third age.
.²2 A name of VARUṆA, regent of the waters.
.²3 A horse.
.²4 A sort of deer. f. ({#-mA#})
.²1 A woman, a female, a pleasing or beautiful female.
.²2 Asafoetida.
.²3 A river. n. ({#-maM#})
.²1 A potherb, (Chenopodium album.)
.²2 A sort of Costus, (C. speciosus.)
.E. {#rama#} to sport, aff. {#GaY.#}
Note that currently, there are no div
in the digitization.
The record for rAma is one long single-line text string in wil.xml. For readability here, I've manually inserted linebreaks.
<H1><h><key1>rAma</key1><key2>rAma</key2></h><body>
<s>rAma</s> mfn. (<s>-maH-mA-maM</s>)
<div n="1">1 Black. </div>
<div n="1">2 White. </div>
<div n="1">3 Beautiful, pleasing. m. (<s>-maH</s>) </div>
<div n="1">1 A name common to three incarnations of VIṢṆU, or PARAŚURĀMA, the son of the
<i>Muni</i> JAMADAGNI, born at the commencement of the second or <i>Tretā</i> <i>Yug,</i> for the purpose of punishing the tyrannical kings of the <i>Kṣatriya</i> race;
RĀMACANDRA, the son of DAŚARATHA, king of <i>Oude,</i> born at the close of the second age, to destroy the demons who infested the earth, and especially
RĀVAṆA the <i>Daitya</i> sovereign of <i>Ceylon;</i> and
BALARĀMA, the elder and half-brother of KṚṢṆA, the son of ROHIṆĪ, born at the end of the
<i>Dvāpara</i> or third age. </div>
<div n="1">2 A name of VARUṆA, regent of the waters. </div>
<div n="1">3 A horse. </div>
<div n="1">4 A sort of deer. f. (<s>-mA</s>) </div>
<div n="1">1 A woman, a female, a pleasing or beautiful female. </div>
<div n="1">2 Asafoetida. </div>
<div n="1">3 A river. n. (<s>-maM</s>) </div>
<div n="1">1 A potherb, (Chenopodium album.) </div>
<div n="1">2 A sort of Costus, (C. speciosus.) </div>
<div n="E">E. <s>rama</s> to sport, aff. <s>GaY.</s> </div>
</body><tail><L>32263</L><pc>704</pc></tail></H1>
make_xml.py converts wil.txt to wil.xml.
Part of this conversion is specific to the way the wil.txt is coded and is in the dig_to_xml_specific
function.
def dig_to_xml_specific(x):
""" changes particular to wil digitization
x is a line of the digitization
"""
if x.startswith('<H>'):
# Start of section beginning with a particular letter. Drop this line
x = ''
elif re.search(u'^[.]²[0-9]+',x):
# a division coded by Thomas
# drop the initial '.²'
# and start <div n="1">
x = '<div n="1">' + x[2:]
elif re.search(r'^[.]E[.]',x):
# an Etymology division
# drop the initial '.'
# and start <div n="E">
x = '<div n="E">' + x[1:]
elif re.search(r'^[.]',x):
# unknown division
print "UNKNOWN DIVISION: ",x.encode('utf-8')
x = " " + x
else:
# assume a simple continuation line
x = " " + x
# In a currently small number of cases (as with root 'RI'), sub-meanings
# are coded with superscript letters, as '^a'. We'll code these as
# <div n="2">
x = re.sub(r'[\^]','<div n="2">',x)
return x
Note : The above only inserts the opening div tag , based on a regex. For example
.²3
in dig.txt become <div n="1">3
.
This div markup is not an empty tag, so it requires a closing </div>
tag. This closing tag is
inserted at a proper spot by the close_divs
function of make_xml.py. This close_divs function
is fairly general, not specific to wil dictionary.
A solution would require that
div
type (say, <div n="3">
, be inserted at the appropriate spots in wil.xml;
also the closing div would need to be inserted at the appropriate spot.3
is recognized
as valid for wil.xmlThe markup will be added by some program. The obvious choice of program to add markup, given the above description, would be in make_xml.py.
HOWEVER,
I think it actually would be better to add all the div markup to wil.txt.
Reason: Where to put the divs pertaining to gender will be tricky -- Thomas already had done
the numerical subdivision markup .²1 Black.
Thus all make_xml had to do was convert this to
some xml form. Also, the fact that Thomas has already put all the text relating to a given
div on one line makes the div-closing problem easy.
But in case of gender divs, this is not so clear in the existing digitization. Perhaps most cases can be handled by simple regex governed changes; but there will almost surely be special cases that will need to be handled by 'manual corrections'. Also, the div closing requirement will also probably need some manual corrections.
Thus, I would vote for doing a special update of wil.txt in order to implement the improvement
suggested above.
This special update might be viewed as a 4-step process:
Version 1: just add the divs as in make_xml.py
{#rAma#}¦ mfn. ({#-maH-mA-maM#})
<div n="1">1 Black.</div>
<div n="1">2 White.</div>
<div n="1">3 Beautiful, pleasing. m. ({#-maH#})</div>
<div n="1">1 A name common to three incarnations of VIṢṆU, or PARAŚURĀMA, the ... </div>
etc.
Version 2 : also add the markup as in the 'dig_to_xml_general' part of make_xml.py. If that were done, wil1.txt would look like:
<s>rAma</s>¦ mfn. (<s>-maH-mA-maM</s>)
<div n="1">1 Black.</div>
<div n="1">2 White.</div>
<div n="1">3 Beautiful, pleasing. m. (<s>-maH</s>)</div>
<div n="1">1 A name common to three incarnations of VIṢṆU, or PARAŚURĀMA, the ... </div>
etc.
Put final result back into main update regiment
A lot of things to do, oh boy.
We are still missing.
m. (<s>-maH</s>)
needs to be a separate div.
We are still missing ... div
My note above shows the approach I think is needed to do this. But, as Marcis noted, it's not
a simple task.
I'm not volunteering to do this, although I agree adding the gender markup would be an enhancement to Wilson dictionary, and indeed to many other dictionaries.
I'm not volunteering to do this
Good to know that. Otherwise nothing would be left for the generations to come.
Problem
As the masculine, feminine, neuter etc are not marked with a div marking, they are merged with previous line in the display. In the following entry see
m. (-मः)
,f. (-मा)
,n. (-मं)
. They should be ideally on the next line, ideally with some kind of div marking.As this is a major correction, noted here.
We are still missing.
m. (<s>-maH</s>)
needs to be a separate div.We are still missing ... div
My note above shows the approach I think is needed to do this. But, as Marcis noted, it's not a simple task.
I'm not volunteering to do this
Good to know that. Otherwise nothing would be left for the generations to come.
It is a very simple task, and we had the WIL done in that way back from 2016, when we added the Skt. Dictionaries at andhrabharati.com
It did not take even hours, just a couple of minutes of work for us!
Problem
As the masculine, feminine, neuter etc are not marked with a div marking, they are merged with previous line in the display. In the following entry see
m. (-मः)
,f. (-मा)
,n. (-मं)
. They should be ideally on the next line, ideally with some kind of div marking.As this is a major correction, noted here.
Sample