sanskrit-lexicon / csl-corrections

Replacement for sanskrit-lexicon/CORRECTIONS. User corrections to sanskrit-lexicon/csl-orig
GNU General Public License v3.0
0 stars 0 forks source link

WIL separating genders #10

Open drdhaval2785 opened 4 years ago

drdhaval2785 commented 4 years ago

Problem

As the masculine, feminine, neuter etc are not marked with a div marking, they are merged with previous line in the display. In the following entry see m. (-मः), f. (-मा), n. (-मं). They should be ideally on the next line, ideally with some kind of div marking.

As this is a major correction, noted here.

Sample

राम mfn. (-मः-मा-मं)
1 Black.
2 White.
3 Beautiful, pleasing. m. (-मः)
1 A name common to three incarnations of VIṢṆU, or PARAŚURĀMA, the son of the Muni JAMADAGNI, born at the commencement of the second or Tretā Yug, for the purpose of punishing the tyrannical kings of the Kṣatriya race; RĀMACANDRA, the son of DAŚARATHA, king of Oude, born at the close of the second age, to destroy the demons who infested the earth, and especially RĀVAṆA the Daitya sovereign of Ceylon; and BALARĀMA, the elder and half-brother of KṚṢṆA, the son of ROHIṆĪ, born at the end of the Dvāpara or third age.
2 A name of VARUṆA, regent of the waters.
3 A horse.
4 A sort of deer. f. (-मा)
1 A woman, a female, a pleasing or beautiful female.
2 Asafoetida.
3 A river. n. (-मं)
1 A potherb, (Chenopodium album.)
2 A sort of Costus, (C. speciosus.)
E. रम to sport, aff. घञ्।
gasyoun commented 4 years ago

m. (-मः), f. (-मा), n. (-मं).

You think end of one is beginning of another?

funderburkjim commented 4 years ago

For reference, here are the pieces that would be involved in adding the markup suggested above.

funderburkjim commented 4 years ago

rAma in wil.txt

<L>32263<pc>704<k1>rAma<k2>rAma
{#rAma#}¦ mfn. ({#-maH-mA-maM#})
.²1 Black.
.²2 White.
.²3 Beautiful, pleasing. m. ({#-maH#})
.²1 A name common to three incarnations of VIṢṆU, or PARAŚURĀMA, the son of
the {%Muni%} JAMADAGNI, born at the commencement of the second or {%Tretā%} 
{%Yug,%} for the purpose of punishing the tyrannical kings of the {%Kṣatriya%}
race; RĀMACANDRA, the son of DAŚARATHA, king of {%Oude,%} born at the close
of the second age, to destroy the demons who infested the earth, and especially
RĀVAṆA the {%Daitya%} sovereign of {%Ceylon;%} and BALARĀMA, the elder and
half-brother of KṚṢṆA, the son of ROHIṆĪ, born at the end of the
{%Dvāpara%} or third age.
.²2 A name of VARUṆA, regent of the waters.
.²3 A horse.
.²4 A sort of deer. f. ({#-mA#})
.²1 A woman, a female, a pleasing or beautiful female.
.²2 Asafoetida.
.²3 A river. n. ({#-maM#})
.²1 A potherb, (Chenopodium album.)
.²2 A sort of Costus, (C. speciosus.)
.E. {#rama#} to sport, aff. {#GaY.#}

Note that currently, there are no div in the digitization.

funderburkjim commented 4 years ago

rAma in wil.xml

The record for rAma is one long single-line text string in wil.xml. For readability here, I've manually inserted linebreaks.

<H1><h><key1>rAma</key1><key2>rAma</key2></h><body> 
<s>rAma</s>  mfn. (<s>-maH-mA-maM</s>) 
<div n="1">1 Black. </div>
<div n="1">2 White. </div>
<div n="1">3 Beautiful, pleasing. m. (<s>-maH</s>) </div>
<div n="1">1 A name common to three incarnations of VIṢṆU, or PARAŚURĀMA, the son of  the 
<i>Muni</i> JAMADAGNI, born at the commencement of the second or <i>Tretā</i>   <i>Yug,</i> for the purpose of punishing the tyrannical kings of the <i>Kṣatriya</i>  race; 
RĀMACANDRA, the son of DAŚARATHA, king of <i>Oude,</i> born at the close  of the second age, to destroy the demons who infested the earth, and especially  
RĀVAṆA the <i>Daitya</i> sovereign of <i>Ceylon;</i> and 
BALARĀMA, the elder and  half-brother of KṚṢṆA, the son of ROHIṆĪ, born at the end of the  
<i>Dvāpara</i> or third age. </div>
<div n="1">2 A name of VARUṆA, regent of the waters. </div>
<div n="1">3 A horse. </div>
<div n="1">4 A sort of deer. f. (<s>-mA</s>) </div>
<div n="1">1 A woman, a female, a pleasing or beautiful female. </div>
<div n="1">2 Asafoetida. </div>
<div n="1">3 A river. n. (<s>-maM</s>) </div>
<div n="1">1 A potherb, (Chenopodium album.) </div>
<div n="1">2 A sort of Costus, (C. speciosus.) </div>
<div n="E">E. <s>rama</s> to sport, aff. <s>GaY.</s>  </div>
</body><tail><L>32263</L><pc>704</pc></tail></H1>
funderburkjim commented 4 years ago

make_xml.py

make_xml.py converts wil.txt to wil.xml. Part of this conversion is specific to the way the wil.txt is coded and is in the dig_to_xml_specific function.

def dig_to_xml_specific(x):
 """ changes particular to wil digitization
     x is a line of the digitization
 """
 if x.startswith('<H>'):
  # Start of section beginning with a particular letter. Drop this line
  x = ''
 elif re.search(u'^[.]²[0-9]+',x):
  # a division coded by Thomas
  # drop the initial '.²'
  # and start <div n="1">
  x = '<div n="1">' + x[2:]
 elif re.search(r'^[.]E[.]',x):
  # an Etymology division 
  # drop the initial '.'
  # and start <div n="E">
  x = '<div n="E">' + x[1:]
 elif re.search(r'^[.]',x):
  # unknown division
  print "UNKNOWN DIVISION: ",x.encode('utf-8')
  x =  " " + x
 else:
  # assume a simple continuation line
  x = " " + x
 # In a currently small number of cases (as with root 'RI'), sub-meanings
 # are coded with superscript letters, as '^a'. We'll code these as
 # <div n="2">
 x = re.sub(r'[\^]','<div n="2">',x)
 return x

Note : The above only inserts the opening div tag , based on a regex. For example .²3 in dig.txt become <div n="1">3. This div markup is not an empty tag, so it requires a closing </div> tag. This closing tag is inserted at a proper spot by the close_divs function of make_xml.py. This close_divs function is fairly general, not specific to wil dictionary.

funderburkjim commented 4 years ago

approaches to a solution.

A solution would require that

Where to add markup?

The markup will be added by some program. The obvious choice of program to add markup, given the above description, would be in make_xml.py. HOWEVER, I think it actually would be better to add all the div markup to wil.txt. Reason: Where to put the divs pertaining to gender will be tricky -- Thomas already had done the numerical subdivision markup .²1 Black. Thus all make_xml had to do was convert this to some xml form. Also, the fact that Thomas has already put all the text relating to a given div on one line makes the div-closing problem easy.

But in case of gender divs, this is not so clear in the existing digitization. Perhaps most cases can be handled by simple regex governed changes; but there will almost surely be special cases that will need to be handled by 'manual corrections'. Also, the div closing requirement will also probably need some manual corrections.

make changes in wil.txt

Thus, I would vote for doing a special update of wil.txt in order to implement the improvement suggested above.
This special update might be viewed as a 4-step process:

  1. wil1.txt do the div markup consistent with the current version of make_xml.py
    • make modifications to make_xml.py so it can handle wil2.txt and generate a new wil.xml. At this point, the newly generated wil.xml should be exactly identical to the previous wil.xml.
  2. wil2.txt add as much as possible of the the gender-div markup to wil1.txt
    • frequently remake new version of wil.xml, based on wil2.txt, and be sure it is valid xml.
    • also make adjustments to basicdisplay.php to be sure the displays with new divs looks as desired.
    • Do all the changes to make_xml and basicdisplay outside the normal wilson update process.
  3. Handle special cases by some 'manualByLine' changes.
    • Again, test test test
  4. Put final result back into main update regiment:
    • revised version of wil.txt in csl-orig
    • revised version of make_xml.py in csl-pywork
    • revised version of basicDisplay.php in csl-websanlexicon.
funderburkjim commented 4 years ago

What wil1.txt might look like

Version 1: just add the divs as in make_xml.py

    {#rAma#}¦ mfn. ({#-maH-mA-maM#})
   <div n="1">1 Black.</div>
   <div n="1">2 White.</div>
   <div n="1">3 Beautiful, pleasing. m. ({#-maH#})</div>
   <div n="1">1 A name common to three incarnations of VIṢṆU, or PARAŚURĀMA, the ... </div>
   etc.

Version 2 : also add the markup as in the 'dig_to_xml_general' part of make_xml.py. If that were done, wil1.txt would look like:

    <s>rAma</s>¦ mfn. (<s>-maH-mA-maM</s>)
   <div n="1">1 Black.</div>
   <div n="1">2 White.</div>
   <div n="1">3 Beautiful, pleasing. m. (<s>-maH</s>)</div>
   <div n="1">1 A name common to three incarnations of VIṢṆU, or PARAŚURĀMA, the ... </div>
   etc.
gasyoun commented 4 years ago

Put final result back into main update regiment

A lot of things to do, oh boy.

drdhaval2785 commented 4 years ago

We are still missing.

m. (<s>-maH</s>) needs to be a separate div.

funderburkjim commented 4 years ago

We are still missing ... div

My note above shows the approach I think is needed to do this. But, as Marcis noted, it's not a simple task.
I'm not volunteering to do this, although I agree adding the gender markup would be an enhancement to Wilson dictionary, and indeed to many other dictionaries.

gasyoun commented 4 years ago

I'm not volunteering to do this

Good to know that. Otherwise nothing would be left for the generations to come.

Andhrabharati commented 1 month ago

Problem

As the masculine, feminine, neuter etc are not marked with a div marking, they are merged with previous line in the display. In the following entry see m. (-मः), f. (-मा), n. (-मं). They should be ideally on the next line, ideally with some kind of div marking.

As this is a major correction, noted here.

We are still missing.

m. (<s>-maH</s>) needs to be a separate div.

We are still missing ... div

My note above shows the approach I think is needed to do this. But, as Marcis noted, it's not a simple task.

I'm not volunteering to do this

Good to know that. Otherwise nothing would be left for the generations to come.

It is a very simple task, and we had the WIL done in that way back from 2016, when we added the Skt. Dictionaries at andhrabharati.com

image

It did not take even hours, just a couple of minutes of work for us!