drdhaval2785 commented 5 months ago

Dear all, This issue has been going on in my mind for long. In many CDSL dictionaries, we have line breaks as per printed dictionaries. In many, we don't.

This issue is devoted to deciding the usefulness or otherwise of line breaks

Pros

Helps locate a sentence in printed dictionary, because text looks almost the same in digitized form.
May help in highlighting the blocks in PDF for given Lnum.

Cons

Unnecessary hyphenation.
Unable to search for hyphenated words which lie on the edge of lines, without some hack.
In many applications / frontend, the display breaks at line breaks whereas a continuous display would have been better. Tabs / mobile / browsers have different screen sizes. In smaller screens like Mobile, one line of CDSL data may run for two lines and break abruptly at second line. This was particularly true of stardict apps.

Should we change from line breaks to sans line breaks?

The question deserves attention, because @Andhrabharati submits his major corrections in the later format. If we agree in principle to go with that format, we won't have hassle of analysing diffs and spending a lot of time.

what about invertibity?

We can have a json with lnum as key. It will hold "old" and "new" text blobs. So, it will be possible to go back and forth.

In case we made some change to our new data, diff can be found out and the same can be carried to old one, if wished.

Only the changes at line ends will not be carried back computationally. It will have to be handled manually.

Historical experience

We had made a quantum jump when we moved from Anglicised Sanskrit to IAST / metaline. We also had invertibility principle then. But in practice, no one has ever shown any interest to carry back changes made to IAST version to AS version. Same may happen here. Much ado about nothing.

View

My view is that we should do away with line breaks.

What do others say?

drdhaval2785 commented 5 months ago

Example of ugly line breaks in the frontend of stardicts.

Andhrabharati commented 5 months ago

@drdhaval2785

I am glad that your very first 'keen' attempt of looking into my file(s) 'prompted' you to think of changing the 'stand' (that stood for many years now).

I know (for sure) that Jim could add few more Cons to your list and I have many more (but that is not worth spending my time at).

And apart from 'leaving away' the line-breaks, I make several other 'important' structural changes in my files. [Probably you would be noticing them and bring onto board for discussion/voting, as you spend some more time looking at various files that I had posted.]

Coming to retaining the line-breaks, I think they should be retained at "verse blocks" in VCP and SKD, that span into multiple columns (and even multiple pages) many a time. Reading such long unbroken matter would be a bad experience, as the reader's mind now, more or less, is 'tuned' to the "semantic breaks" introduced in printing. But within the 'prose' paragraphs, they can be got away with.

Now coming to the Pros that you had listed--

As I understand, the need for looking at the print dictionary comes up, to compare the digital text, mostly for correcting the errors being reported by the users (or otherwise). How is it being dealt in the case of MW, that is the mostly reported work [I would roughly estimate it to be 90-95% in the user feedback], whose digital text does not contain the line-breaks and also has deviated (a bit too-)much from the print (except for having the page-column info in-tact)?

Andhrabharati commented 5 months ago

Speaking of the MW digital text format, I thought I should 'leak' that my current working 'prompted' me to make some major structural changes in it, some of them moving closer to print matter.

I am sure that this would create some hiccups, if (and when) I post my MW work.

drdhaval2785 commented 5 months ago

Any thoughts @funderburkjim?

gasyoun commented 5 months ago

@drdhaval2785

We can have a json with lnum as key. It will hold "old" and "new" text blobs. So, it will be possible to go back and forth.

And double the size of each dictionary?

My view is that we should do away with line breaks.

I'm for it. But that would take years for just this one task and stop all the others, is it worth now?

@Andhrabharati

need for looking at the print dictionary comes up, to compare the digital text, mostly for correcting the errors being reported by the users (or otherwise).

exactly

MW, that is the mostly reported work [I would roughly estimate it to be 90-95% in the user feedback]

right

Andhrabharati commented 5 months ago

But that would take years for just this one task and stop all the others, is it worth now?

On what basis did you arrive at this, Marcis?

I have been doing this (removing or alt. marking the line-breaks) in just few minutes in each of the CDSL dictionary, that I work upon!

funderburkjim commented 5 months ago

removing or alt. marking the line-breaks)

In AB's [revision to MD](https://github.com/sanskrit-lexicon/csl-orig/commit/2dffafb599d0b4f5d2015a87d3ed469b8dd32e02 dictionary), he introduced the convention of using a special character to indicate line breaks. 🞄 = U+1F784

{#a#}¦ <hom>1.</hom> a, <ab>pn.</ab> {%root used in the inflexion of%} 
idam 🞄{%and in some particles%}: a-tra, a-tha.

make_xml.py can 'ignore' this character, so it doesn't get in the way of displays.

This seems like a good solution.

As a general point, I think that preservation of line breaks have served there purpose. The original 'later' digitizations provide by @maltenth honored line breaks -- this was in part to help in the internal double-entry error detection process of the output by the Sanskrit typists.

Then, when I came to make displays for these later dictionaries, I thought it was best to preserve line breaks in the displays to aid in correction investigation.

We are now in process of making major revisions to these original forms -- adding markup, tooltips, links, etc. so the dictionary displays more useful. These changes also provide the basis for future NLP-type work with the dictionary corpora (e.g. DAtu extraction).

So line-break preservation is no longer as useful as it once was. For some dictionaries (Burnouf and Apte90 come to mind), I used a <lbinfo> tag to preserve line-break info. But I think AB's use of a special character is better -- it doesn't get in the way as much.

Current opinion: For cdsl dictionaries where line-breaks currently preserved, use the special character. But feel free to use multiline forms in the xxx.txt (e.g. at the <div n="pfx"> for semantically meaningful breaks. For example:


OLD (CURRENT)
<L>578<pc>018-b<k1>arT<k2>arT
{#arT#}¦ 10. {%P.%} (v. {#arTa#}) petere, postulare (gr. <lang n="greek">αἰτέω</lang> dissoluto
{%r%} in vocalem {%i%}, cf. {#arTa#}).
<div n="pfx">c. {#pra#} petere, appetere, desiderare, concupiscere. BR. 2. 11.
12. 13. 16. IN. 5. 33. SU. 1. 26. 3. 11.
<div n="pfx">c. {#sam#} cogitare, putare, existimare. UR. 18. 9. 18. 5. infr.
<LEND>

NEW? 
<L>578<pc>018-b<k1>arT<k2>arT
{#arT#}¦ 10. {%P.%} (v. {#arTa#}) petere, postulare (gr. <lang n="greek">αἰτέω</lang> dissoluto 🞄{%r%} in vocalem {%i%}, cf. {#arTa#}).🞄
<div n="pfx">c. {#pra#} petere, appetere, desiderare, concupiscere. BR. 2. 11.🞄 12. 13. 16. IN. 5. 33. SU. 1. 26. 3. 11.🞄
<div n="pfx">c. {#sam#} cogitare, putare, existimare. UR. 18. 9. 18. 5. infr.
<LEND>

funderburkjim commented 5 months ago

@Andhrabharati Do you convert line-breaks (`\n') to

🞄 (one character)
🞄 (two characters, space before 🞄)
🞄 (two characters, space after 🞄)

Also, how do you handle end-of-line hyphens?

vvasuki commented 5 months ago

make_xml.py can 'ignore' this character, so it doesn't get in the way of displays.

This seems like a good solution.

No - if you're using xml, use a (specially defined) xml tag and not some adhoc special-meaning-symbols.

sanskrit-lexicon / COLOGNE

With or without linebreaks #419

Pros

Cons

Should we change from line breaks to sans line breaks?

what about invertibity?

Historical experience

View