Closed Andhrabharati closed 2 years ago
The INM extracted file posted by @funderburkjim on which I had worked earlier is fully with <>. It can be seen here- https://github.com/sanskrit-lexicon/CORRECTIONS/issues/92#issue-63343712 And my workout here- https://github.com/sanskrit-lexicon/CORRECTIONS/issues/92#issuecomment-909247765
How come @drdhaval2785 got the <div n="lb">
for me for the same INM?
This puts me in a strong position in reiterating to make all works with <> line markers alone.
all works with
<>
line markers
The INM file referenced by the here
link in the first comment of Corrections #92 was generated in 2015.
My guess is that at some time in the interim, the <>
markup of 2015 was changed by me to <div n="lb">
.
It is a reasonable suggestion that the line-break markup should be consistent across dictionaries.
Whether this consistent form should be <>
or <div n="lb">
or something else is a separate question.
Currently, the program (make_xml.py in csl-pywork/v02) which makes xxx.xml from xxx.txt changes the
line-break markup of xxx.txt to some valid xml form; and if we global change the markup in xxx.txt, then
we would need to be sure make_xml.py (after possible modification) handles the changed markup properly. (Note that type 4 <div n="lb"/>
is proper xml, and probably the make_xml.py code changes
<>
to <div n="lb"/>
and <div n="lb">
to <div n="lb"/>
.
One small advantage of <>
is that it is shorter, less obtrusive. As long as a given digitization is maintained with line-break consistency with the printed text, then some line-beginning markup should
also be maintained.
The type listing in the first comment above is useful. But the listing shows 13 of type 1, not 18.
If we change to <>
would require changing 17 dictionaries (Type 2 and 4).
For each dictionary we would need to
We would have to be careful to avoid conflict with other changes being made to the dictionaries.
What do @drdhaval2785 and @gasyoun think? Should we make <>
the standard?
But the listing shows 13 of type 1, not 18.
The other 5 are in Type-3.
Upon further reflection, I think we should also consider the option of having NO markup for line breaks in those dictionaries where the digitization was made to observe line breaks (AB also suggests this option to be considered).
Without <>
or other line break markup:
<>
or other existing line-break markup is repetitive of the inherent '\n' that
separates 'lines' in text files. With <>
or other line break markup:
I completely agree that we can do away with <>
tags without any information loss.
It is a reasonable suggestion that the line-break markup should be consistent across dictionaries.
Totally.
is proper xml
And that is why it makes more sense. But agree with Dhaval it's easier to read the code without anything. So I agree with whatever will be decided.
This comment (from @drdhaval2785 ref) seems relevant to this issue:
I feel that there are some positive sides of keeping line breaks. They are
1. The lines are human readable in majority of text editors and also on command line.
2. Texts which have relatively human readable sized lines are better amenable to git processes. If the line is too long, locating the actual change between two commits is a pain.
3. It makes physical comparision from printed text easier.
I completely agree that we can do away with <> tags without any information loss.
@drdhaval2785 So, as I understand it, you would
<>
,<div n="lb"/>
, <div n="lb">
Right?
Yes. I agree with your paraphrase.
Good to see this happening in one work (the INM) now, which incidentally is the one with which I had started this issue.
@funderburkjim Would you pl. do the same in all the works in a batch at once, so that this issue could be closed?
@Andhrabharati Do we already have a list of those works which need to have line break markup removed? If not, would you provide such a list?
@funderburkjim
Pl. go to the top of this issue (https://github.com/sanskrit-lexicon/csl-devanagari/issues/26#issue-987903322), and you'd get all the info.
Aim to go through the list and remove line-break markers . Also will make corresponding changes to make_xml.py and (if needed) to basicadjust.php or basicdisplay.php -- Goal is to have no change in display details after removal. Will direct commit messages to this issue, by which progress may be followed.
Conversion Progress: line-break removed from acc ,ap90 ,md ,mw72 ,pe ,pgn , pui and make_xml.py correspondingly revised where necessary.
Conversion Progress: line-break markup removed from shs, skd, snp, vcp, vei, yat and make_xml.py correspondingly revised.
Conversion progress: Through the type-2 dictionaries: ben, bhs, bop, gst, ieg, inm, krm, mci
The type-3 dictionaries have been handled. Some of the type-4 dictionaries remain to be converted.
ae converted. Also corrected markup to expose a few headwords. See commit 4c74ae2.
The other two english-sanskrit dictionaries (bor, mwe) converted.
The remaining type-4 dictionaries have been previously handled.
I think this completes all the conversions.
The type-5 are mostly the 'early' dictionaries (see https://github.com/sanskrit-lexicon/COLOGNE/issues/385).
line-marker conversion not relevant for these.
@Andhrabharati --- I think this issue may be closed.
Yes, @funderburkjim, checked that they are all done.
But you may consider removing the <>
in the gst_advertisement.txt also, even if it is not being used.
And would you close this issue?
And a BIG THANKS to you, for doing this piece of work; now one can do a "free" (unobtrusive) manual reading of the files, if & when required.
My next request proposal to you, whenever you feel like taking it up, is to work on hyphen-ending lines in all the texts; you had done this in few works recently.
[As you had rightly mentioned elsewhere, this would ease the ls and other markings.]
Removed the <>
from gst advertisement.
@Andhrabharati thanks for preparing the lists of 'types' above -- a big help in organizing the task solution.
A major change.
Removed the
<>
from gst advertisement.
@funderburkjim Seen that there are two {??} places in this gst advertisement, which skipped my attention earlier. [All other {??} incidences in the CDSL texts were filled up sometime back.]
This gst advertisement text may be corrected as below-
<HI>
BHOTANTA DICTIONARY AND GRAMMAR. Dictionary of the Bhotanta or Boutan Language; printed from a Ms. copy, made by the late R{??} Q. C. G. Schroeter, edit. by J. Marshman. To which is pre- {??} Grammar of the Bhotanta language by Schroeter, edited by W. Carey. 4to. Serampore 1826.
to
<HI>
BHOTANTA DICTIONARY AND GRAMMAR. Dictionary of the Bhotanta or Boutan Language; printed from a Ms. copy, made by the late Rev. F. C. G. Schroeter, edit. by J. Marshman. To which is pre- pended a Grammar of the Bhotanta language by Schroeter, edited by W. Carey. 4to. Serampore 1826.
Type-1: <> in 13 works acc : 30653 ap90 : 167050 md : 61497 mw72 : 1187 pe : 828 pgn : 4292 pui : 55 shs : 53028 skd : 442533 snp : 154 vcp : 355786 vei : 1780 yat : 24834 Out of these, ap90 has two lines (58049, 147982) beginning with a space and yat has one line (88) not at the beginning.
Type-2 :
<
div n="lb">
in 8 works ben : 76326 bhs : 69874 bop : 18705 gst : 27883 ieg : 8618 inm : 84332 krm : 24135 mci : 67357 Out of these, mci has one line (12049) beginning with a space.Type-3 : both
<>
&<
div n="lb">
in 5 works gst : 345 ieg : 1135 inm : 2193 krm : 1540 mci : 2522 All these<>
appear to be in front matter (preface etc.) and end matter (corrections/annexures etc.)Type-4 :
<
div n="lb"/>
in 9 works ae : 42839 bor : 66228 mw72 : 216137 mwe : 88481 pe : 96209 pgn : 5731 pui : 23738 snp : 2119 vei : 35317Type-5 : Miscellaneous in 11 works armh : this is a spl. category work in Skt. verses, and is split at hemi-stiches of verses; no addl. marking made or necessary in this case. bur : no relation to print or length; just split randomly after a word ends and without any line markers cae : no relation to print or length; just split randomly after a word ends and without any line markers ccs : appears to be split as per print, but without any line markers gra : all single line entries even if the length is "big"; however each of different word endings (some are comp.) are split separate lines (even if the length is "big") lan : appears to be split as per print, but without any line markers mw : this is split in a style of its own (mostly at meaning senses and genders), different from all others pw : this has no relation to print style or length; just split at gender and meaning indicators pwg : this has no relation to print style or length; just split at the beginning of assumed
<ls>...</ls>
(many of which are wrong!) sch : all single line entries even if the length is "big" wil : has special markings at line beginningsCan a common marking for all works, say
<>
, or no marking be considered as a theme?