Semantic line breaks - Githubissues

funderburkjim commented 1 year ago

big lines undesirable

Of the 17000+ entries, roughly 5% of the entries have text blobs of size 1000+ characters. The biggest entry (lakzaRa) has 16K characters. A list of the headwords (k1) for this top 5% is in file compare/textsize.txt. The numbers are in hundreds of characters.

These big entries are hard to work with, e.g., to make a spelling correction. or just to read.

semantic breaks.

The entries in BHS could be made more useful by adding semantically-meaningful line breaks, along with <div n="X"/> type markup.

A side effect of adding semantic division markup would be removal of big text blobs.

This may be considered an enhancement idea for BHS digitization.

It has a lower priority than work on the Boehtlingk dictionaries.

funderburkjim commented 1 year ago

The red dots indicate where line breaks in the Ananda entry would make the entry more comprehensible.

Andhrabharati commented 1 year ago

The red dots indicate where line breaks in the Ananda entry would make the entry more comprehensible.

Yes, I start with such marking, before removing the line breaks altogether in my working. [see my recent GRA & pwk files, as examples.]

If it helps in some manner, I can "retain" this marking. [But I see that quite a few cdsl works have lost the line-breaks as per the print; and some have very haphazard breaks, about which I had already posted earlier.]

Andhrabharati commented 1 year ago

Another such marking that I chose to remove is the hyphenation, which also is in varied styles in different cdsl works. [I presumed that once the hyphenation is "resolved" (which I did in my working) at the line-breaks, there is no point retaining the markup.]

funderburkjim commented 1 year ago

Resolution of hyphens at end of line is good.

Hope that in PWK, PWG, PWKVN you are retaining the line breaks of xxx.txt when there is <div> markup.

Indeed most of the 'old' cdsl digitizations lost line breaks.

The 'newer' digitizations (those with funding from DFG-NEH Project 2010-2013) typically retained line breaks.

In a very few dictionaries, e.g. ap90.txt, I have tried to retain line breaks information while resolving end-of-line hyphenation. Thomas did something similar with Burnouf. There is also the lbinfo tag in ccs.txt, md.txt and stc.txt.

However, this 'lbinfo' approach is quite awkward, and I think it is now appropriate for cdsl versions to drop this requirement.

Thus, I think that line-break retention should now be dropped in favor of what I am terming 'semantic' markup. The <div> element is the current markup tool that we have for such semantic markup. The n-attributes of the div element (e.g. <div n=X/>) can be coordinated with the display to generate useful line breaks or indentation in entry displays.

large text blobs are problematic.

Andhrabharati commented 1 year ago

Hope that in PWK, PWG, PWKVN you are retaining the line breaks of xxx.txt when there is div markup.

Yes, I retained them as is for now; but I intend to drop some of these and introduce some new ones (something similar to what was done in GRA recently).

Shall be posting once I conclude my thinking..

sanskrit-lexicon / BHS

Semantic line breaks #2

big lines undesirable

semantic breaks.