Open funderburkjim opened 1 year ago
The red dots indicate where line breaks in the Ananda entry would make the entry more comprehensible.
The red dots indicate where line breaks in the Ananda entry would make the entry more comprehensible.
Yes, I start with such marking, before removing the line breaks altogether in my working. [see my recent GRA & pwk files, as examples.]
If it helps in some manner, I can "retain" this marking. [But I see that quite a few cdsl works have lost the line-breaks as per the print; and some have very haphazard breaks, about which I had already posted earlier.]
Another such marking that I chose to remove is the hyphenation, which also is in varied styles in different cdsl works. [I presumed that once the hyphenation is "resolved" (which I did in my working) at the line-breaks, there is no point retaining the markup.]
Resolution of hyphens at end of line is good.
Hope that in PWK, PWG, PWKVN you are retaining the line breaks of xxx.txt when there is <div>
markup.
Indeed most of the 'old' cdsl digitizations lost line breaks.
The 'newer' digitizations (those with funding from DFG-NEH Project 2010-2013) typically retained line breaks.
In a very few dictionaries, e.g. ap90.txt, I have tried to retain line breaks information while resolving end-of-line hyphenation. Thomas did something similar with Burnouf. There is also the lbinfo tag in ccs.txt, md.txt and stc.txt.
However, this 'lbinfo' approach is quite awkward, and I think it is now appropriate for cdsl versions to drop this requirement.
Thus, I think that line-break retention should now be dropped in favor of what I am terming 'semantic' markup. The <div>
element is the current markup tool that we have for such semantic markup.
The n-attributes of the div element (e.g. <div n=X/>
) can be coordinated with the display to generate useful line breaks or indentation in entry displays.
large text blobs are problematic.
Hope that in PWK, PWG, PWKVN you are retaining the line breaks of xxx.txt when there is div markup.
Yes, I retained them as is for now; but I intend to drop some of these and introduce some new ones (something similar to what was done in GRA recently).
Shall be posting once I conclude my thinking..
big lines undesirable
Of the 17000+ entries, roughly 5% of the entries have text blobs of size 1000+ characters. The biggest entry (lakzaRa) has 16K characters. A list of the headwords (k1) for this top 5% is in file compare/textsize.txt. The numbers are in hundreds of characters.
These big entries are hard to work with, e.g., to make a spelling correction. or just to read.
semantic breaks.
The entries in BHS could be made more useful by adding semantically-meaningful line breaks, along with
<div n="X"/>
type markup.A side effect of adding semantic division markup would be removal of big text blobs.
This may be considered an enhancement idea for BHS digitization.
It has a lower priority than work on the Boehtlingk dictionaries.