sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

elipsis characted in pw.txt #52

Closed drdhaval2785 closed 8 years ago

drdhaval2785 commented 8 years ago

I came across an interesting observation. There are 1576 occurrences of (elipsis character http://www.codetable.net/decimal/8230) Most of them invisible to naked eye.

@funderburkjim Please have a look at them.

gasyoun commented 8 years ago

Representing book content or added meta data, Jim?

funderburkjim commented 8 years ago

As I've mentioned elsewhere, Thomas used the ellipsis as a 'soft space' character.

I think he thought of it as a sort of 'glue'. He used this widely in the original MONIER.ALL (part of the mwtxt.zip download, renamed as mw_orig.txt) that we (Peter, Malcolm, and I) started with in 2006. In mw.xml, the frequent <c>...</c> coding is based on interpreting the ellipsis as glue, that connects some 'chunk' (that's what the c stands form) of content.

Thomas continued this usage of ellispsis in PW and PWG. For instance, in the original pw.txt there were 681351 instances of this ellipsis character, according to pw-meta.txt, which also mentions that on Nov 7, 2015, I decided it was safe to remove these within groupings such as the representation of devanagari #{xyz} or of italics {%xyz%}. In this case, the opening and closing curly braces already provide grouping, so the use of the ellipsis as a 'sticky space character' was unneeded.

This removed the bulk of the ellipsis characters. I think I did something similar in PWG, possibly in a couple of other dictionaries of that era (ca. 2005).

However, there were still some additional ellipsis characters. Because of the way Thomas coded the literary sources, with a beginning character (the macron) but with no curly-brace-like beginning and ending characters, the ellipsis serves a useful purpose. However, in our current work, I am essentially taking the point of view that we eliminate the gluing space (ellipsis). This introduces some awkwardness and infidelity of representation, but my judgment is that these drawbacks of the removal of the remaining ellipses are not material.

In the displays, (the disp.php program) the ellipses are uniformly changed to normal space characters, so that is why they are invisible to the naked eye.

funderburkjim commented 8 years ago

@thomasincambodia Thomas, if you see this, maybe you would comment on how you thought of the ellipsis character.

Also, I am curious about how the markup of the literary sources came to be. It is my understanding that the initial digitization for PW was done by the Sanskrit typists. But, I suspect that you did various kinds of 'post-processing' or 'adjustment' to the initial digitization, as it seems unlikely that the Sanskrit typists would have entered by hand 600,000+ ellipsis characters as they were typing!

Also, there are so many variances in the literary sources (such as capitalization) between the digitization and the printed page that it seems unlikely to me that these were all introduced by the original typing, but that instead, that they may be artifacts of post processing to identify the literary sources.

I hope you'll see this and share with us a bit of this history.

drdhaval2785 commented 8 years ago

From past experience it seems that @thomasincambodia replies very promptly when his name is mentioned. So i think the case is that he is not notified of the other threads. I have not checked whether he is made a collaborator in sanskrit-lexicon organisation. @funderburkjim may add him so that he is notified in all developments in various issues. Or Thomas may click on follow button and watching button to follow all discussion if he wishes.

funderburkjim commented 8 years ago

@drdhaval2785 / @gasyoun I was not sure of the details when I added Thomas in sanskrit-lexicon organization.

Maybe you should review the settings at https://github.com/orgs/sanskrit-lexicon/people for @thomasincambodia and adjust if needed.

I'm also not sure whether Thomas has bandwidth issues, and on the frequency with which he checks the notifications in his email. Also, due to the high volume of such notifications recently, he may be (like me) swamped by the notification flood.

gasyoun commented 8 years ago

@funderburkjim https://github.com/orgs/sanskrit-lexicon/people/thomasincambodia thomasincambodia has access to 4 repositories

sanskrit-lexicon / PWK
sanskrit-lexicon / Cologne
sanskrit-lexicon / CORRECTIONS
sanskrit-lexicon / GreekInSanskrit

Was before. Now made Thomas an Admin and

sanskrit-lexicon / PWG
Admin on this repository
sanskrit-lexicon / PWK
Admin on this repository
sanskrit-lexicon / Cologne
Admin on this repository
sanskrit-lexicon / SCH
Admin on this repository
sanskrit-lexicon / MWS
Admin on this repository
sanskrit-lexicon / DCS
Admin on this repository
sanskrit-lexicon / VCP
Admin on this repository
sanskrit-lexicon / ApteES
Admin on this repository
sanskrit-lexicon / SKD
Admin on this repository
sanskrit-lexicon / MW72
Admin on this repository
sanskrit-lexicon / MCI
Admin on this repository
sanskrit-lexicon / GRA
Admin on this repository
sanskrit-lexicon / CORRECTIONS
Admin on this repository
sanskrit-lexicon / WIL
Admin on this repository
sanskrit-lexicon / ArabicInSanskrit
Admin on this repository
sanskrit-lexicon / Wil-YAT
Admin on this repository
sanskrit-lexicon / GreekInSanskrit
Admin on this repository
sanskrit-lexicon / Cologne-Sanskrit-Tamil
Admin on this repository
sanskrit-lexicon / hwnorm1
Admin on this repository
sanskrit-lexicon / sanskrit-lexicon.github.io
Admin on this repository
sanskrit-lexicon / BHS
Admin on this repository
sanskrit-lexicon / VEI
Admin on this repository
funderburkjim commented 8 years ago

@gasyoun Thanks, Marcis!

funderburkjim commented 8 years ago

@drdhaval2785 I am inclined to do nothing 'globally' with the remaining ellipsis instances now.

If they occur in some problem, then some action will be taken.

funderburkjim commented 8 years ago

@gasyoun

From an email communication with Thomas, I got the impression that the number of postings is a problem.

Is there a way in GitHub for Thomas to find all the messages containing a @thomasIncambodian message directed to him?

gasyoun commented 8 years ago

From an email communication with Thomas, I got the impression that the number of postings is a problem. unmake him and admin? I have emails turned off. I check issues myself.

Is there a way in GitHub for Thomas to find all the messages containing a @thomasIncambodian message directed to him? - not that I'm aware of.

funderburkjim commented 8 years ago

After some fiddling, this global github search looks promising:

https://github.com/search?q=thomasincambodia+&type=Issues&utf8=%E2%9C%93

This slightly simpler one seems to do the same thing:

https://github.com/search?q=thomasincambodia+&type=Issues

gasyoun commented 8 years ago

Indeed, please advise Thomas. Let him turn email notifications off in this case, as have I.

drdhaval2785 commented 8 years ago

So ellipsis are going to stay. I am closing this issue. Will ignore ellipsis when they interfere with my work.