sanskrit-lexicon / csl-devanagari

Convert SLP1 data from csl-orig into Devanagari for easy proofreading.
0 stars 1 forks source link

Line markers for "body portion" #26

Closed Andhrabharati closed 2 years ago

Andhrabharati commented 3 years ago

Type-1: <> in 13 works acc : 30653 ap90 : 167050 md : 61497 mw72 : 1187 pe : 828 pgn : 4292 pui : 55 shs : 53028 skd : 442533 snp : 154 vcp : 355786 vei : 1780 yat : 24834 Out of these, ap90 has two lines (58049, 147982) beginning with a space and yat has one line (88) not at the beginning.

Type-2 : <div n="lb"> in 8 works ben : 76326 bhs : 69874 bop : 18705 gst : 27883 ieg : 8618 inm : 84332 krm : 24135 mci : 67357 Out of these, mci has one line (12049) beginning with a space.

Type-3 : both <> & <div n="lb"> in 5 works gst : 345 ieg : 1135 inm : 2193 krm : 1540 mci : 2522 All these <> appear to be in front matter (preface etc.) and end matter (corrections/annexures etc.)

Type-4 : <div n="lb"/> in 9 works ae : 42839 bor : 66228 mw72 : 216137 mwe : 88481 pe : 96209 pgn : 5731 pui : 23738 snp : 2119 vei : 35317

Type-5 : Miscellaneous in 11 works armh : this is a spl. category work in Skt. verses, and is split at hemi-stiches of verses; no addl. marking made or necessary in this case. bur : no relation to print or length; just split randomly after a word ends and without any line markers cae : no relation to print or length; just split randomly after a word ends and without any line markers ccs : appears to be split as per print, but without any line markers gra : all single line entries even if the length is "big"; however each of different word endings (some are comp.) are split separate lines (even if the length is "big") lan : appears to be split as per print, but without any line markers mw : this is split in a style of its own (mostly at meaning senses and genders), different from all others pw : this has no relation to print style or length; just split at gender and meaning indicators pwg : this has no relation to print style or length; just split at the beginning of assumed <ls>...</ls> (many of which are wrong!) sch : all single line entries even if the length is "big" wil : has special markings at line beginnings

Can a common marking for all works, say <>, or no marking be considered as a theme?

Andhrabharati commented 3 years ago

The INM extracted file posted by @funderburkjim on which I had worked earlier is fully with <>. It can be seen here- https://github.com/sanskrit-lexicon/CORRECTIONS/issues/92#issue-63343712 And my workout here- https://github.com/sanskrit-lexicon/CORRECTIONS/issues/92#issuecomment-909247765

How come @drdhaval2785 got the <div n="lb"> for me for the same INM?

This puts me in a strong position in reiterating to make all works with <> line markers alone.

funderburkjim commented 2 years ago

all works with <> line markers

The INM file referenced by the here link in the first comment of Corrections #92 was generated in 2015.

My guess is that at some time in the interim, the <> markup of 2015 was changed by me to <div n="lb">.

It is a reasonable suggestion that the line-break markup should be consistent across dictionaries. Whether this consistent form should be <> or <div n="lb"> or something else is a separate question.

Currently, the program (make_xml.py in csl-pywork/v02) which makes xxx.xml from xxx.txt changes the line-break markup of xxx.txt to some valid xml form; and if we global change the markup in xxx.txt, then we would need to be sure make_xml.py (after possible modification) handles the changed markup properly. (Note that type 4 <div n="lb"/> is proper xml, and probably the make_xml.py code changes <> to <div n="lb"/> and <div n="lb"> to <div n="lb"/>.

One small advantage of <> is that it is shorter, less obtrusive. As long as a given digitization is maintained with line-break consistency with the printed text, then some line-beginning markup should also be maintained.

The type listing in the first comment above is useful. But the listing shows 13 of type 1, not 18. If we change to <> would require changing 17 dictionaries (Type 2 and 4). For each dictionary we would need to

  1. change xxx.txt in csl-orig
  2. make appropriate change to make_xml.py
  3. regenerate the dictionary (csl-pywork/v02) locally
    • validate xml
    • visually validate correctness of display (comparing Cologne to local installation)
  4. modify xxx-meta2.txt to indicate this facet of markup in xxx.txt.
  5. Then do the needful pushes and pulls to get the change installed at Cologne server.

We would have to be careful to avoid conflict with other changes being made to the dictionaries.

funderburkjim commented 2 years ago

What do @drdhaval2785 and @gasyoun think? Should we make <> the standard?

Andhrabharati commented 2 years ago

But the listing shows 13 of type 1, not 18.

The other 5 are in Type-3.

funderburkjim commented 2 years ago

Upon further reflection, I think we should also consider the option of having NO markup for line breaks in those dictionaries where the digitization was made to observe line breaks (AB also suggests this option to be considered).

Without <> or other line break markup:

With <> or other line break markup:

drdhaval2785 commented 2 years ago

I completely agree that we can do away with <> tags without any information loss.

gasyoun commented 2 years ago

It is a reasonable suggestion that the line-break markup should be consistent across dictionaries.

Totally.

is proper xml

And that is why it makes more sense. But agree with Dhaval it's easier to read the code without anything. So I agree with whatever will be decided.

funderburkjim commented 2 years ago

This comment (from @drdhaval2785 ref) seems relevant to this issue:

I feel that there are some positive sides of keeping line breaks. They are
1. The lines are human readable in majority of text editors and also on command line.
2. Texts which have relatively human readable sized lines are better amenable to git processes. If the line is too long, locating the actual change between two commits is a pain.
3. It makes physical comparision from printed text easier. 
funderburkjim commented 2 years ago

I completely agree that we can do away with <> tags without any information loss.

@drdhaval2785 So, as I understand it, you would

Right?

drdhaval2785 commented 2 years ago

Yes. I agree with your paraphrase.

Andhrabharati commented 2 years ago

Good to see this happening in one work (the INM) now, which incidentally is the one with which I had started this issue.

@funderburkjim Would you pl. do the same in all the works in a batch at once, so that this issue could be closed?

funderburkjim commented 2 years ago

@Andhrabharati Do we already have a list of those works which need to have line break markup removed? If not, would you provide such a list?

Andhrabharati commented 2 years ago

@funderburkjim

Pl. go to the top of this issue (https://github.com/sanskrit-lexicon/csl-devanagari/issues/26#issue-987903322), and you'd get all the info.

funderburkjim commented 2 years ago

Beginning the conversion

Aim to go through the list and remove line-break markers . Also will make corresponding changes to make_xml.py and (if needed) to basicadjust.php or basicdisplay.php -- Goal is to have no change in display details after removal. Will direct commit messages to this issue, by which progress may be followed.

funderburkjim commented 2 years ago

Conversion Progress: line-break removed from acc ,ap90 ,md ,mw72 ,pe ,pgn , pui and make_xml.py correspondingly revised where necessary.

funderburkjim commented 2 years ago

Conversion Progress: line-break markup removed from shs, skd, snp, vcp, vei, yat and make_xml.py correspondingly revised.

funderburkjim commented 2 years ago

Conversion progress: Through the type-2 dictionaries: ben, bhs, bop, gst, ieg, inm, krm, mci

funderburkjim commented 2 years ago

The type-3 dictionaries have been handled. Some of the type-4 dictionaries remain to be converted.

funderburkjim commented 2 years ago

ae converted. Also corrected markup to expose a few headwords. See commit 4c74ae2.

funderburkjim commented 2 years ago

The other two english-sanskrit dictionaries (bor, mwe) converted.

The remaining type-4 dictionaries have been previously handled.

I think this completes all the conversions.

funderburkjim commented 2 years ago

The type-5 are mostly the 'early' dictionaries (see https://github.com/sanskrit-lexicon/COLOGNE/issues/385).

line-marker conversion not relevant for these.

@Andhrabharati --- I think this issue may be closed.

Andhrabharati commented 2 years ago

Yes, @funderburkjim, checked that they are all done.

But you may consider removing the <> in the gst_advertisement.txt also, even if it is not being used.

And would you close this issue?

Andhrabharati commented 2 years ago

And a BIG THANKS to you, for doing this piece of work; now one can do a "free" (unobtrusive) manual reading of the files, if & when required.

Andhrabharati commented 2 years ago

My next request proposal to you, whenever you feel like taking it up, is to work on hyphen-ending lines in all the texts; you had done this in few works recently.

[As you had rightly mentioned elsewhere, this would ease the ls and other markings.]

funderburkjim commented 2 years ago

Removed the <> from gst advertisement.

@Andhrabharati thanks for preparing the lists of 'types' above -- a big help in organizing the task solution.

gasyoun commented 2 years ago

A major change.

Andhrabharati commented 2 years ago

Removed the <> from gst advertisement.

@funderburkjim Seen that there are two {??} places in this gst advertisement, which skipped my attention earlier. [All other {??} incidences in the CDSL texts were filled up sometime back.]

This gst advertisement text may be corrected as below-

<HI>BHOTANTA DICTIONARY AND GRAMMAR. Dictionary of the Bhotanta or Boutan Language; printed from a Ms. copy, made by the late R{??} Q. C. G. Schroeter, edit. by J. Marshman. To which is pre- {??} Grammar of the Bhotanta language by Schroeter, edited by W. Carey. 4to. Serampore 1826.

to

<HI>BHOTANTA DICTIONARY AND GRAMMAR. Dictionary of the Bhotanta or Boutan Language; printed from a Ms. copy, made by the late Rev. F. C. G. Schroeter, edit. by J. Marshman. To which is pre- pended a Grammar of the Bhotanta language by Schroeter, edited by W. Carey. 4to. Serampore 1826.