Allow for common-sense manual improvements to punctuations and formatting

vvasuki commented 2 years ago

I observed in a few threads some insistence on sticking to "what's in the printed text" - even with regards to punctuation and formatting!

Opening this thread so that it may be considered more fully. Some pertinent notes:

Given that git + dict system allows:

version control so as to retrieve "pristine" versions of files
distributed correction effort
checking of proposed changes with diffs
easy comparison with text images

why not let manual formatting improvements come through at whatever rate they do - as long as they don't affect future programmatic corrections?

gasyoun commented 2 years ago

why not let manual formatting improvements come through at whatever rate they do - as long as they don't affect future programmatic corrections?

You want to go away from the original dictionary format?

vvasuki commented 2 years ago

You want to go away from the original dictionary format?

Where it makes sense - yes! One has to use "common sense" and see from the perspective of dict users. Not so hard. Constraints of printing in 2-column format paper 100+ years ago don't apply to computer screens. And users have come to adapt new equivalent notations and routinely use more punctuations.

vvasuki commented 2 years ago

Also, today's scenario where users easily and routinely refer to dozens of dicts side by side, consistency in notation becomes a matter of concern (Eg. https://github.com/sanskrit-lexicon/csl-ldev/issues/7 ). That too motivates harmless deviations from the original.

vvasuki commented 2 years ago

Everyone read this please (via @drdhaval2785 at https://github.com/sanskrit-lexicon/csl-ldev/issues/7#issuecomment-1044433948 ):

The creation of a TEI version of the Cologne Sanskrit Lexicon is part of the Lazarus Project1 and aims for long-time preservation of the data. It is based on the original digitisations and mark-up versions of the CSL and uses the TEI Guidelines, especially the dictionary module. The objective of the TEI Cologne Sanskrit Lexicon is to preserve all information contained in the original prints, as far as it was preserved in the digitisation process (Kapp and Malten, 1997, as described in), while using a well docu- mented and standardised XML. The second objective is to display the information as con- sistent and faithfully as possible to the original prints, while allowing the user to choose the writing system in which the Sanskrit words are displayed.

So, no one needs to obsess over "keeping it close to original" here. Others have that aspect well under control. This project can move along to the objective of best serving today's users.

vvasuki commented 2 years ago

Case in point - https://github.com/indic-dict/stardict-sanskrit/issues/139

vvasuki commented 1 year ago

The same dissatisfaction bothers me. Do I feel like reading the mess below?

It could be presented so much better. I hope this changes either here or in some project which will render all this obsolete.

funderburkjim commented 1 year ago

Link for TEI Sanskrit Lexicon: http://c-salt.uni-koeln.de/

There is no ongoing collaboration between the 'Github/sanskrit-lexicon' (CDSL) project at Cologne and the 'C-SALT' project at Cologne.

Maybe @fxru could provide a description of the relation between CDSL and C-SALT.

funderburkjim commented 1 year ago

... this mess could be so much better

Would you provide a mock-up of a better presentation? This would help others understand what is in your mind.

vvasuki commented 1 year ago

There is no ongoing collaboration between the 'Github/sanskrit-lexicon' (CDSL) project at Cologne and the 'C-SALT' project at Cologne.

I didn't say there was; and that's good think too! That leaves both projects free to pursue their distinct goals without compromise. The goal of CDSL should be to present what the dict maker intended in the best possible way given the current non-paper media and tech.

... this mess could be so much better

Would you provide a mock-up of a better presentation? This would help others understand what is in your mind.

विकल्पः, पुं, (विरुद्धं कल्पनमिति । वि + कृप + घञ् ।) 

भ्रान्तिः ।
 (यथा, देवीभाग-वते । १ । १९ । ३२ ।
“विकल्पोपहतस्त्वं वै दूरदेशमुपागतः ।
न मे विकल्पसन्देहो निर्व्विकल्पोऽस्मि सर्व्वथा ॥”)

कल्पनम् । इति मेदिनी । पे, ॥
(यथा, भागवते । ५ । १६ । २ ।
“तत्रापि प्रितव्रतरथचरणपरिखातैः सप्तभिः सप्त सिन्धवः उपकॢप्ताः ।   
यत एतस्याः सप्तद्वीपविशेषविकल्पस्त्वया भगवन् खलु सूचितः ॥”)

संशयः । यथा, रघुः । १७ । ४९ ।
(“रात्रिन्दिवविभागेषु यथादिष्टं महीक्षिताम् ।
तत्सिषेवे नियोगेन स विकल्पपराङ्मुखः ॥”)

नानाविधः । यथा, मनुः । ९ । २२८ ।
(“प्रच्छन्नं वा प्रकाशं वा तन्निषेवेत यो नरः ।
तस्य दण्डविकल्पः स्याद्तथेष्टं नृपतेस्तथा ॥”)

विविधकल्पः । स च द्विविधः । व्यवस्थितः । एच्छिकश्च । सोऽप्याकाङ्क्षाविरहे युक्तः ।
 तथा च भविष्ये -

See how much more pleasant and readable that is?

funderburkjim commented 1 year ago

Certainly the format you show is pleasant.

From my naive perspective, I do not see how it derives from the vacaspatyam text -- there is almost no overlap between the two texts.

What am I missing?

vvasuki commented 1 year ago

What am I missing?

That was kalpadruma. Compare with:

Also, please refer to https://github.com/sanskrit-lexicon/csl-ldev/pull/3#issuecomment-1043240375 linked in the first post above - there was even an objection to the addition of quotation marks around quotes because "Not traceable in the printed text"! Such robotic fidelity should be dropped.

funderburkjim commented 1 year ago

Markup can generate the nicer format.

funderburkjim commented 1 year ago

Here is the bit of the vikalpa digitization corresponding to sample display:

OLD
<L>32332<pc>4-371-b<k1>vikalpaH<k2>vikalpaH
vikalpaH¦, puM, (virudDaM kalpanamiti . vi +
kfpa + GaY .) BrAntiH . (yaTA, devIBAga-
vate . 1 . 19 . 32 .
“vikalpopahatastvaM vE dUradeSamupAgataH .
na me vikalpasandeho nirvvikalpo'smi sarvvaTA ..”)
kalpanam . iti medinI . pe, .. (yaTA, BAga-
vate . 5 . 16 . 2 .
“tatrApi pritavrataraTacaraRapariKAtEH saptaBiH
sapta sinDavaH upakxptAH . yata etasyAH sapta-
dvIpaviSezavikalpastvayA Bagavan Kalu sUcitaH ..”

And the changes which generate the above:

NEW
vikalpaH¦, puM, (virudDaM kalpanamiti . vi +
kfpa + GaY .) <lb/><lb/>BrAntiH . <lb/>(yaTA, devIBAgavate <lbinfo n="devIBAga+vate"/>
. 1 . 19 . 32 .
<lb/>“vikalpopahatastvaM vE dUradeSamupAgataH .
<lb/>na me vikalpasandeho nirvvikalpo'smi sarvvaTA ..”)
<lb/><lb/>kalpanam . iti medinI . pe, .. <lb/>(yaTA, BAgavate <lbinfo n="BAga+vate"/>
. 5 . 16 . 2 .
<lb/>“tatrApi pritavrataraTacaraRapariKAtEH saptaBiH
sapta sinDavaH upakxptAH . <lb/>yata etasyAH saptadvIpaviSezavikalpastvayA <lbinfo n="sapta+dvIpaviSezavikalpastvayA"/>
Bagavan Kalu sUcitaH ..”

funderburkjim commented 1 year ago

As you see, there are only two pieces of markup:

<lb/> to generate a line break
<lbinfo n="X+Y/> to resolve text with extra '-' at line breaks.

The lbinfo is awkward to write, but could be simplified such as

kfpa + GaY .) <lb/><lb/>BrAntiH . <lb/>(yaTA, devIBAgavate <lbinfo n="devIBAga+vate"/>
SIMPLER, using a special character (such as '@')
kfpa + GaY .) <lb/><lb/>BrAntiH . <lb/>(yaTA, devIBAga@vate

Thus, at least for skd, the digitization could be changed so that

the display is considerably easier to read, and
the 'sanctity' of the original digitization is maintained.

funderburkjim commented 1 year ago

For comparison to the skd-dev example above, here is the current display of vikalpaH in skd:

vvasuki commented 1 year ago

the 'sanctity' of the original digitization is maintained.

Why put that burden on yourself? As mentioned there is a separate project focused on "sanctitiy"-preservation.
Sure - I suppose that @drdhaval2785 's scripts can insert such extra new-lines or quotes using your markup based on what users update (at csl-dev?) - it's just more (unnecessary) trouble; and is furthermore a cause for delay.

funderburkjim commented 1 year ago

it's just more (unnecessary) trouble; and is furthermore a cause for delay.

What is your proposed remedy? What is your proposed path ending in a better display of skd?

funderburkjim commented 1 year ago

Why put that burden on yourself? As mentioned there is a separate project focused on "sanctity"-preservation.

[Here is link to 'lazarus project' : https://cceh.uni-koeln.de/portfolio/lazarus/]

I think this ('sanctity ...') remains a responsibility of CDSL. We can't just say 'Oh, someone else is taking care of this aspect.'

However, we are not restricted to only this task.
We are free to create better displays, for instance better displays for skd.
@vvasuki Are you interested in leading an effort for a better skd?

vvasuki commented 1 year ago

We are free to create better displays, for instance better displays for skd. @vvasuki Are you interested in leading an effort for a better skd?

No. All I want is for users (myself included) to be able to add superior presentation markup wherever they care to while referring to the dict, and for maintainers not to reject such improvements out of hand. So, it should be written down in some contribution policy somewhere.

And, @drdhaval2785 - please clear backlog at https://github.com/sanskrit-lexicon/csl-ldev/pulls - I recently thought of editing some typo, but gave up upon seeing it.

I think this ('sanctity ...') remains a responsibility of CDSL. We can't just say 'Oh, someone else is taking care of this aspect.'

CDSL is free to burden itself of course, but I am curious why you think you can't just say 'Oh, someone else is taking care of this aspect.'

drdhaval2785 commented 1 year ago

Will clear backlog soon.

vvasuki commented 9 months ago

Related, but insufficient - https://github.com/sanskrit-lexicon/COLOGNE/issues/419

sanskrit-lexicon / csl-orig

Allow for common-sense manual improvements to punctuations and formatting #747