sanskrit-lexicon / csl-devanagari

Convert SLP1 data from csl-orig into Devanagari for easy proofreading.
0 stars 1 forks source link

BUR Study #37

Open Andhrabharati opened 2 years ago

Andhrabharati commented 2 years ago

@drdhaval2785, @funderburkjim

Just thought of filling up the Greek strings in BUR, and had a quick look at the file & book contents.

  1. There are almost 15000 <P> entries which either do not "appear" in the CDSL online searching as of now, or are part of the prev. <L> entry, though present in the text file. Seems most of these (if not all) have to be "promoted" to <L> status, being alternate HWs or derived HWs etc. wrt the prev. entry. I would suggest marking them all with <L>xxx.n numbering, as separate entries.

  2. There are two good lists of Anubandhas (5pp.) & Dhatus (15pp.) in the book, after the p.759 (where the text file ended), which could also be digitized and added to the search. Do not know if this already done and lying somewhere "inaccessible". (Could not see them even in the bur_orig.txt)

Andhrabharati commented 2 years ago
  1. There are almost 9900 <lbinfo> tags, which are used to mark hyphenated words at the line-crossovers. But as the book "lines" are NOT maintained in the text file, these have no use at all and could be simply removed.
Andhrabharati commented 2 years ago
  1. The punctuation marks (. , ; : ?), before the %} mostly need to be kept after the%}.
  2. The {%(...)%} and {%[...]%} markings to be changed to ({%...%}) and [{%...%}].
gasyoun commented 2 years ago

could also be digitized and added to the search.

Here is my scan: https://vk.com/samskrtamru?w=wall-88831040_13310

racines

as the book "lines" are NOT maintained in the text file

Sounds like a pity.

The punctuation marks (. , ; : ?), before the %} mostly need to be kept after the %}.

Guess is something you can do yourself @Andhrabharati with the pull github function?

Andhrabharati commented 2 years ago

The punctuation marks (. , ; : ?), before the %} mostly need to be kept after the %}.

Guess is something you can do yourself @Andhrabharati with the pull github function?

Just fyi, I have been doing all such stuff myself, and (unfortunately!) I do much more than what cologne team can 'accept' (when I feel it leads to a better 'presentation' of the text, I leave no stone unturned).

funderburkjim commented 2 years ago

@Andhrabharati In the current Cologne system, a given dictionary xxx exists in three related forms:

There is also a stardict dictionary form created in https://github.com/sanskrit-lexicon/cologne-stardict repository (which Dhaval maintains completely).

So, if you create a 'better' form of some xxx.txt, then that may be incompatible with the make_xml.py or with the php display code. I think you find this incompatibility frustrating.

On the other hand, many changes can be made to xxx.txt that ARE compatible with the Cologne system.

In regard to your particular suggestions re Burnouf dictionary, I suggest you fill in the Greek text in csl-orig/v02/bur/bur.txt. This is a kind of change which is should cause no compatibility problems.

Once this is done, let's discuss further the idea of 'promoting' the <P> subheadwords to full status.

funderburkjim commented 2 years ago

@Andhrabharati Just realized you will likely be starting with bur.txt as it exists in this csl-devanagari repository.

@drdhaval2785 Suppose AB adds greek text to this devanagari version of bur.txt. Do you have a script that generates the slp1 version from the devanagari version? And if so, have you checked invertibility?

drdhaval2785 commented 2 years ago

@funderburkjim

Invertibility is taken care of. See the script redo.sh

echo "Convert to Devanagari."
mkdir -p ../v02/$1
python3 to_devanagari.py $1
echo "Convert back to SLP1."
python3 to_slp1.py $1
echo "Store differences in ../diff/$1.txt."
diff ../slp1/$1.txt ../../csl-orig/v02/$1/$1.txt > ../diff/$1.txt
echo "Complete."
  1. Convert to Devanagari
  2. Convert back to SLP1 and store in SLP1 folder (untracked by github, as it is supposed to be identical with csl-orig data).
  3. Compare the data in SLP1 folder with csl-orig data.
  4. If there is any difference, store it in diff folder.

Once the script is run, manually see that the diff folder holds all files with 0 bytes i.e. there is no difference. This way invertibility is ensured.

When a change is made in csl-devanagari files

See carry_changes_to_cslorig.sh

dicts=(wil yat gst ben mw72 lan cae md mw shs ap90 mwe bor ae bur stc pwg gra pw ccs sch bop armh vcp skd inm vei pui bhs acc krm ieg snp pe pgn mci)
echo "STARTED TAKING CORRECTIONS FROM CSL-DEVANAGARI TO CSL-ORIG";
for dict in ${dicts[@]};
do
    echo $dict
    python3 to_slp1.py $dict
    cp ../slp1/$dict.txt ../../csl-orig/v02/$dict/$dict.txt
    echo "";
done
  1. Convert the changes to SLP1 transliteration and store in SLP1 folder of csl-devanagari repository.
  2. Copy the changes from SLP1 folder to csl-orig folder.
  3. See the git diff in csl-orig folder to ensure that it is as per corrections made in csl-devanagari, if need arises.
  4. add, commit and push changes in csl-orig repository.

Hope this takes care of your concerns about invertibility, Jim.

funderburkjim commented 2 years ago

Thanks for docs. Looks eminently usable. Will give it a trial run if AB uploads a version of devanagari bur.txt with Greek text.

drdhaval2785 commented 2 years ago

Dear @Andhrabharati Please update the csl-devanagari repository and use the latest file. It would minimize the differences.

Andhrabharati commented 2 years ago

latest file? is this repo being updated?

Andhrabharati commented 2 years ago

I think you find this incompatibility frustrating.

Not at all, I keep on doing what I feel better; it's just that CDSL is 'not willing' to 'accept' to undertake the changes, if they seem different to the 'style' adopted there-- having no scope for 'real improvements'.

Andhrabharati commented 2 years ago

Dear @Andhrabharati Please update the csl-devanagari repository and use the latest file. It would minimize the differences.

I could as well just use the latest (SLP1) file from csl-orig itself, if it is just filling the Greek stuff.

But that's too little a portion of the work; I point to my recent INM work in this context, wherein I did quite some changes, apart from filling the Greek stuff all in one go. (of course, it did not attract the FULL attention of Jim.)

Andhrabharati commented 2 years ago

could also be digitized and added to the search.

Here is my scan: https://vk.com/samskrtamru?w=wall-88831040_13310

These pages are even at csldoc, as 'Dictionary front matter'; a misnomer for these particular pages!!

drdhaval2785 commented 2 years ago

After many years of association with CDSL, I would like to paraphrase your viewpoint so that it correctly reflects the status of collective wisdom at CDSL.

CDSL is 'not willing' to adopt major changes which do not allow programmatic conversion between current version and suggested version programmatically.

Andhrabharati commented 2 years ago

I do understand the point well, @drdhaval2785.

What I fail to understand is-- while programs are being modified or even developed for small changes, why the same is NOT being done for major changes. It's just beyond my comprehension!

Anyway, let's not spend more time on this, but continue the efforts in bringing the texts to "correct form" first and fill the gaps (if any).

("Presentation" can be taken up by someone sometime, if it deserves!)

gasyoun commented 2 years ago

having no scope for 'real improvements'.

It it is not in book - we can't accept such and improvement. Even if we like it.

These pages are even at csldoc

My scan quality is higher.

why the same is NOT being done for major changes. It's just beyond my comprehension!

Are you ready to code it? Jim is busy with things only he can do. We do not have enough coders on board.

bringing the texts to "correct form" first and fill the gaps (if any).

Exactly, thanks.

Andhrabharati commented 2 years ago

"If" it is not in book - we can't accept such "an" improvement.

I can show innumerable instances contradicting this, that are already present in the CDSL texts! (But I do not want to drag the issue any further.)

My scan quality is higher.

Yes, noticed this. How many such others do you have?

Are you ready to code it?

Yes, I can; but I won't (at least for time-being)!

funderburkjim commented 2 years ago

I could as well just use the latest (SLP1) file from csl-orig itself, if it is just filling the Greek stuff.

Yes, that is so.

Andhrabharati commented 2 years ago

I am already halfway through my file, with many more changes already done.

And I presumed giving just the ref. line (<L> number or the <P> string whichever is applicable) and the greek strings in it would ease CDSL work.

funderburkjim commented 2 years ago

From my perspective, the best form would be a copy of bur.txt with all the Greek text filled in.

As a second choice, a file of changes to the lines of bur.txt. For example,, the first Greek text appears on line 19 of bur.txt, so a file 'bur-change.txt' would have:

19 old <lang n="greek"></lang>; <ab>lat.</ab> {%in;%} <ab>germ.</ab> {%un.%}
19 new <lang n="greek">GREEK TEXT</lang>; <ab>lat.</ab> {%in;%} <ab>germ.</ab> {%un.%}

and a similar pair of 'old/new' lines for each of the other 667 lines with greek text.

As a third choice, a file of the lines changed. For example, the first Greek text appears on line 19 of bur.txt, so a file 'bur-greek.txt' would have as its first line

19 <lang n="greek">GREEK TEXT</lang>; <ab>lat.</ab> {%in;%} <ab>germ.</ab> {%un.%}

and similarly for the other 667 lines with greek text.

Andhrabharati commented 2 years ago

My file has no line breaks now; all entries are in a single line.

But, I prefer making the second form (but slightly different)- 19 old <lang n="greek"></lang>; 19 new <lang n="greek">GREEK TEXT</lang>; [limiting only to the Greek portion and the resp. ending punctuation].

And few of them would be with ; comment lines followed.

Would this suit you?

funderburkjim commented 2 years ago

What about the few lines where there is more than one <lang n="greek"></lang> ?

Andhrabharati commented 2 years ago

They would all be in the resp. line, unless a comment line mentions some merger (if any); otherwise all diff. strings would be present individually.

funderburkjim commented 2 years ago

Likely I can reliably convert your form to my second form.

Andhrabharati commented 2 years ago

If you are interested, I can give the full etym. lines (all languages) as well, as many had undergone changes, like tagging or correcting.

But probably sticking to Greek alone in the first step is preferable.

funderburkjim commented 2 years ago

sticking to Greek alone in the first step

Agree

Andhrabharati commented 2 years ago
  1. There are some places where the <L> entry itself has few other <L> candidates.

    For example, <L>4388 ({%kāyastha%} <ab>m.</ab>) has -- {%kāyasthā%} <ab>f.</ab> and -- {%kāyasthī%} <ab>f.</ab> inside.

Andhrabharati commented 2 years ago

The front pages matter (p.3) clearly mentioned the points 1 and 6.

[6] La barre horizontale -- sépare les mots dans un même article. ... ... [1] Après un mot principal écrit en dêvanâgari, nous rangeons ceux de ses dérivés et de ses composés qui se trouveraient placés immédiatement après lui clans l'ordre alphabétique. Les autres dérivés ou composés, que cet ordre écarterait du voisinage immédiat du mot principal, sont rangés a leur place naturelle. De sorte que l'ordre alphabétique est partout suivi.

This indicates that making the digital text of all the dictionaries' "Front matters" (with Google OCRing) and probably translating into English (with DeepL) would be beneficial to understand the dictionaries' well, and plan to work on them properly.

Any takers for this simple task from your 'new team', @gasyoun?

gasyoun commented 2 years ago

I can show innumerable instances contradicting this, that are already present in the CDSL texts!

Indeed there are. I guess it would be a good idea to document them as we know them.

How many such others do you have?

Not sure, not all volumes required, but will show in 2022 what I have.

I can give the full etym. lines

Would love to see them myself.

Any takers for this simple task from your 'new team', @gasyoun?

Can you document the steps for them to be done, please? One by one.

Andhrabharati commented 2 years ago

Any takers for this simple task from your 'new team', @gasyoun?

Can you document the steps for them to be done, please? One by one.

Hope @drdhaval2785 or @funderburkjim would be willing to give the steps.

Andhrabharati commented 2 years ago

Just recalled that you also worked with Abbyy OCR, @gasyoun.

So probably you yourself could get the first step done, by explaining to the team.

Once a quick proofing for obvious errors in the OCRed text is done, translation (as and when required) could be taken up.

funderburkjim commented 2 years ago

@Andhrabharati Regarding Burnouf Front matter. Are you aware of https://www.sanskrit-lexicon.uni-koeln.de/scans/csldev/csldoc/build/dictionaries/prefaces/burpref.html ?

funderburkjim commented 2 years ago

@Andhrabharati Is your main point regarding Burnouf Front matter to make an English Translation of the front matter?

Andhrabharati commented 2 years ago

@Andhrabharati Regarding Burnouf Front matter. Are you aware of https://www.sanskrit-lexicon.uni-koeln.de/scans/csldev/csldoc/build/dictionaries/prefaces/burpref.html ?

yes, I do. In fact, I had already commented previously that even the "end matter" is lying here under the header of "front matter"!

but these are just the images; and I am talking about searchable digital text.

Andhrabharati commented 2 years ago

@Andhrabharati Is your main point regarding Burnouf Front matter to make an English Translation of the front matter?

not really, my main intention is to have a digital text first.

of course, having english text suits some people-- but there would be many people who might like to have the native language text as is.

gasyoun commented 2 years ago

Just recalled that you also worked with Abbyy OCR, @gasyoun.

Yes, since 2002.

https://www.youtube.com/watch?v=oXH65ISgZRo and https://www.youtube.com/c/MarcisGasuns/search?query=abbyy

of course, having english text suits some people-- but there would be many people who might like to have the native language text as is.

Agree

Andhrabharati commented 2 years ago

@funderburkjim

I had finished filling Greek strings in BUR few days back and just waiting for you to be free from the MBh. linking task.

Here are the lines (wrt csl-orig file) as we discussed earlier (above), and hope you won't be facing much issues in using this data.

BUR greek string lines (csl-org) filled.txt

I just like to suggest that you handle the ; commented ones first.

Andhrabharati commented 2 years ago

There are NO ls candidates in BUR, but quite many abbr candidates are there.

Here is the list that covers most of them. BUR abbr. list.txt [The count is more than double the existing CDSL file markings.]

And here are the language abbr. items that could be tagged first, and expanded. BUR language tags.txt

Andhrabharati commented 2 years ago

As I am doubting if you would be interested to do any further changes, not posting my full observations, but only giving some global corrections (just in case you like to correct them) below- BUR corrections.txt

Andhrabharati commented 2 years ago

One final comment before I move on to some other work-

There are quite many grouped entries in this work as well (marked with et, au, ',' or otherwise), and these could be handled as done in MW. [I had earlier suggested doing the same in few other works also, but nothing has happened in that front so far.]

funderburkjim commented 2 years ago

@Andhrabharati From first look at your greek text lines, the form should be readily useable. Will let you know when this is incorporated into bur.txt.

Andhrabharati commented 2 years ago

Here is my full BUR file, for whatever use/worth it has to the cdsl team-- bur (AB ver.) -v2.txt

gasyoun commented 2 years ago

full BUR file

@funderburkjim did you had a chance to take an eye on it ever since?