sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

Integrating printchange data in displays #250

Open funderburkjim opened 5 years ago

funderburkjim commented 5 years ago

This issue is devoted to preliminary discussions starting from comments here.

print changes

The digitizations of the various Cologne dictionaries have undergone numerous corrections from the original digitization form. In most of these, the changes reflect an error in the original digitization (e.g., the typist originally typed 'abc' but the scanned image clearly shows 'def'; so we change the original digitization to 'def' thereby improving the accuracy of the digitization.) We informally call such changes 'typos'.

However, in some cases, the original digitization was 'X' and this is in agreement with the scanned image; but for one reason or another, it was decided that the printed text should be changed to 'Y'.

These 'print change' corrections are much rarer, and we have endeavored over the years to keep a record of them. They appear in the CORRECTIONS repository; for instance the record of such print changes for the MW dictionary is in file mw_printchange.txt.

funderburkjim commented 5 years ago

display of print changes

The current issue is for exploring in more detail how to integrate the print change records in the displays. Take the first example from mw_printchange.txt:

1. L=251911 (DONE)
  tIya should be wIya
  p. 1245,1  under 'sfgAl/a--vAwI'

image

The displays now show: image

The fact that the 'w' (slp1) is a print change from the 't' of the printed text is not visible to the user.

The suggestion is that there should be some trace of this print change correction visible within the display.

This issue is devoted to fleshing out this idea.

If the ideas become adequately specific, a separate repository may be created for the implementation details.

ghost commented 5 years ago

For the printchange corrections, I don't think the justification for correction (i.e., the corresponding text from mw_printchange.txt) need to be displayed at once. I think just a link to the page (as NietzscheSource.org) does would be enough. Also, it would be far simpler to implement.

If I understood correctly, Jim suggests editing mw.txt to make it read like (where the tags names and format may be different, of course):

yada yada yada <OriginalWord>xxz</OriginalWord> <CorrectedWord>xyz</CorrectedWord> <CorrectionID>MWnnn</CorrectionID> yada yada

I suggest simply,

yada yada yada <OriginalWord>xxz</OriginalWord> <CorrectedWord>xyz</CorrectedWord> <CorrectionURL>.../mw_printchange.txt</CorrectionURL> yada yada

Or, if the PHP can pick up and insert the URL from elsewhere (some sort of environment variable?), just

yada yada yada <OriginalWord>xxz</OriginalWord> <CorrectedWord>xyz</CorrectedWord> yada yada

Benefits of the latter two: (i) As I said, displaying the corrections right away is not really needed. In most cases, the justification for the correction would be obvious. And even when it isn't obvious, the visitors won't really wish to know the justification. Anyone who does wish to know more can visit the mw_printchange.txt page itself, and searching there for the term, or by the ID would be obvious. All visitors like @Sonnetag (and me!) would like is that the term be clearly marked.

(ii) The mw_printchange.txt file clearly belongs to the backend. There are abbreviations, the names of people who suggested the change, etc. And you don't wish to add markup (for one, to convert SLP1 to Devanagari), do you?

Besides, in a unified page, the other entries provide a lot of context.

When one visits the printchange.txt file, it is obvious that it belongs to the backend, and so it doesn't need to be pretty, and every entry need not make sense by itself.

(iii) As of now, the mw_printchange.txt page is essentially free-form text. (though there is some order) If you want it to be automatically parsed by a script to convert it into sqlitedatabase, a lot of changes would be needed, and you'd have to mantain the order when editing it in the future.

That would complicate Jim's life! With my suggestion above, every time a new correction is to made, all that would be needed is (i) adding a few tags and one word (i.e., the new word) in mw.txt, and (ii) documenting the change in mw_printchange.txt file.

Adding the tags and the word is easy. The new word would be essentially the same as the old, so you don't need great care in making sure you are making no mistake. (There is little chance of making a mistake.)

Adding the CorrectionID markup to mw.txt would be a different matter altogether. (Specially for the first time, when a few hundred CorrectionIDs would have to be added by hand.)

gasyoun commented 5 years ago

numerous corrections from the original digitization form

Even the original was not 1 to 1.

Jim suggests editing mw.txt

No, he does not. It's only about display, so web related, .txt is left unchanged.

displaying the corrections right away is not really needed

No. If I search, it will be missed, if incorrect. It is not only for the eyes.

(for one, to convert SLP1 to Devanagari), do you?

Yes, it is as it is.

in a unified page, the other entries provide a lot of context.

That I must agree. A pity we still can't have a page mode, where all the other entries on the same page can bee seen at once. That is the reason most people still open the scanned page - to see the other words on same page.

With my suggestion above, every time a new correction is to made, all that would be needed is (i) adding a few tags and one word (i.e., the new word) in mw.txt, and (ii) documenting the change in mw_printchange.txt file.

Not sure I understand how it is different from what is now. mw.txt is never changed, only additional files or lines in them are made.

few hundred CorrectionIDs would have to be added by hand

If Jim would do such monkey-work, he would do no work at all.

drdhaval2785 commented 5 years ago

https://github.com/dsindex/blog/wiki/%5Bpython%5D-difflib,-show-differences-between-two-strings seems like s small script for what we want. Dont feel that it would warrant a separate repository to handle.

ghost commented 5 years ago

What I understood from Jim's comments (which may be wrong): (i) mw.txt changed to add some tags (ii) mw_printchange.txt changed to give each entry its own CorrectionID. (iii) a sqlite database generated from mw_printchange.txt, which would just have two fields: a CorrectionID, and a text blob. (iv) The PHP engine picking up the processed mw.txt file (mw.sqlite), and converting the fields and tags into HTML, getting the required text blob from printchange.sqlite database, and adding CSS and JavaScript as required.

My above comment was about why (ii) and (iii) are not needed.