Closed funderburkjim closed 3 years ago
Recently I received an email from Felix Rau, who is interested in helping with our work on the Cologne Sanskrit Lexicon. With his permission, I am inserting the body of that email here.
I’m writing to you, because I wanted to contribute to some of my time to the Cologne Digital Sanskrit
Lexicon. However, the Cologne Digital Sanskrit Lexicon is so active on Github that it is difficult
to see what would be most helpful to you. Any pointers to where I can contribute something
constructive are more than welcome.
(My relevant skills are: XML, Sanskrit, basic XSLT, more read than write Python.)
Just to give you some context: I’m working in the linguistics department at the University of Cologne.
I’m a linguist and Indologist by training, but despite having studied Sanskrit, it is more of a
hobby of mine. Back then, I studied Tamil under Thomas Malten and worked with him in Orissa.
Now, I work on Munda and Austroasiatic languages and on archiving audio-visual data from
language documentation.
Because of my background, I have been peripherally involved in the Lazarus Projekt of the CCeH.
Basically, I gave the CCeH guys some feedback and in particular looked into a few Sanskrit-specific
things for them. The actual work is over a year old, but I recently formatted it into a sort of
report so it wouldn’t go to waste, well at least not without a digital trace.
You can find the two texts here:
https://github.com/fxru/vedic-accent-in-lexicography
https://github.com/fxru/Sanskrit-lf-x-report
I would love to contribute and if you could point me to the parts where I can be the most help,
I would be happy.
I mentioned this dtd review task as a concrete way to get started (since Felix is comfortable with xml). And he's begun the review of the various dtds.
He's a native of Germany, so can help us when we have issues involving German language.
I also added Felix as a member of the sanskrit-lexicon Github organization.
Welcome, Felix!
Felix, Sie sind wilkommen! @funderburkjim what's his nickname? If he has XML expertise than DTD is great for him. I have a long love story with XML, but there are no children from that love, so let Felix do what he can do best. Sanskrit NLP will want all of his spare time and even more!
Welcome Felix. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/181 Here we noted somethings which we thought important aboutba year ago. Not much has progresses in these issues since then. Maybe you can choose and pick simething out of it. Best Wishes, Dr. Dhaval Patel
Thanks, धन्यवाद, and спаси́бо you all!
As Jim already said, I started looking into the different DTDs, I made it through all but mw.dtd so far, but I only have a superficial understanding of the issues so far. As a first impression, I can report that not two DTDs are identical in regard to the inventory of elements (or their definition).
Some things seem rather obvious (for now). For example, 12 dictionaries use <br/>
to indicate line breaks, while 18 use <lb/>
(while 4 seem to use neither). Right now, I can’t see any good reason, why there are two different elements encoding line breaks.
Another example is that skd and vcp use <C>
and a number attribute to mark columns, while pe, inm, and vei use different elements – <C1>
, <C2>
, <C3>
... – to achieve the same.
I will report about the comparison of the 35 DTDs here, but add additional issues in particular repositories, when I stumble upon lexicon specific issues, so that these things are documented somewhere. (For starters, I opened one issue in pwg: https://github.com/sanskrit-lexicon/PWG/issues/21 – if you think there is a better way to deal with things I come across, let me know.)
Re <lb/>
and <br/>
Probably no good reason to have both. There were two places where
rather arbitrary choices crept in. First, in the digitizations that Thomas did, and second in the xml conversions that I did. Both were done with no specific eye to uniformity across dictionaries.
I hope that when these arbitrary choices are resolved, there will be a much smaller number of xml forms that will handle all the cases. I am imagining some kind of template-based framework into which all the dtds will fit, with an allied framework to generate the programs that create and display the xml forms from the .txt digitizations. My current candidate for a templating program to use in this is the Python mako
templating system. But so far, this is thoughtware, not software.
When you document the instances involving multiple dictionaries, I suggest you do so in a separate issue in this repository. Also, for dictionary-specific issues, you can follow your pwg model if such a repository exists for the dictionary; if no such repository exists, we'll have to decide if a new one should be opened.
धन्यवाद might want to have a singular or plural ending, be it धन्यवादः!
report that not two DTDs are identical in regard to the inventory of element
It's good that at least they are UTF8 now, so you can easily compare. Just two years ago nobody but Jim even cared about the DTDs, because one could not even download a lot (only MW was there). But thanks to GitHub things have changed.
Let's replace it.
Another example is that skd and vcp use
and a number attribute to mark columns, while pe, inm, and vei use different elements – , , ... – to achieve the same.
What would you think work best?
template-based framework into which all the dtds will fit
Yeah, 35 DTDs is 34 too many, agree.
I surveyed the different DTDs (minus mw for now), I tried to document it and put it here:
https://github.com/fxru/CDSL-DTD-comparison/blob/master/comparison_CDSL_DTDs.csv
I will keep updating it, but now you can take a look for yourself.
<br/>
vs. <lb/>
The two are defined in the following DTDs:
br
: acc, ap, bhs, bor, bur, inm, pd, pe, pui, pwg, vei, wil
lb
: ae, ap, ap90, ben, bop, gst, ieg, krm, mci, md, mw72, mwe, pgn, shs, skd, snp, vcp, yat
However the element might not occur in the dictionaries themselves. (E.g. no occurrence of <br/>
in bur.)
The crucial dictionary is ap.xml, because ap.dtd defines br
as well as lb
. Since ap.xml isn’t available online, I couldn’t check whether both elements actually occur. The question here is: Do both occur in ap? If yes, do both simply encode line breaks or do they encode different things?
If the situation in ap is not t would make sense to decided on either br
or lb
and homogenize the 29 DTDs for line breaks.
pic
vs. Picture
pic
occurs in ben and wil, while Picture
occurs in vcp.
ben: 1 occurrence in the entry vatsa
, empty tag, references a png file (<pic name='vatsa.png'/>
)
wil: 1 occurrence in the entry svastika
, non-empty (<pic>svastica</pic>
). This occurrence could actually be replaced by a unicode character, e.g. 卐 U+5350 wàn (Chinese block) or ࿕ U+0FD5 RIGHT-FACING SVASTI SIGN (Tibetan block)
vcp: 71 occurrences in 4 entries, all are empty (<Picture/>
) and don’t provide any further information.
C1
vs. C
+@n
The case of C1
up to C11
in inm, pe, and vei vs. C
plus a number attribute @n
in skd and vcp is slightly more complex, as inm also also has C2H
and C3H
.
However, the bigger problem is, that those actually encode tables and are some of the most complex parts I have seen in the dictionaries, so far.
A
vs. Arabic
A
in pw and wil as well as Arabic
in mw, all marking Arabic text spans, seems to be another candidate for harmonization of the different DTDs, but I haven’t looked into it further.
The other low hanging fruit are elements defined in the DTDs, but not used in in the XML files. I would go hunting for these elements. I saw that Jim removed UL
already from pwg. There seem to be several more instances of unused elements around.
This is helpful.
I found your big dtd table hard to read, mainly because the rows are so long.
I wonder if you could extract from your big table another table.
It might be sort of like a transpose (matrix transpose) of the given big table.
The columns would be dictionary codes.
The rows would be tag names (perhaps with some brief annotation for short names like 'A',
indicating the purpose, if known, such as 'Arabic')..
The entry would be some binary indicator (Y/N, YES/NO, +/-) .
There might be an extra 'count' column, right after the tag name, which would be the total number of dictionaries where the tag is found.
This might further help us see the whole picture of the dtds.
I like the idea of dealing with the 'low-hanging fruit'.
Removing unused tags seems like the lowest-hanging fruit. Do you have any tool that reads an xml file and produces some kind of tag analysis ? This would help in identifying the unused tags (i.e., the tags that appear in the dtd for a dictionary, but that do not occur within the xml file for the dictionary.
Seems like there might be an off-the-shelf such tool, but if not, a simple form of such a tool probably could be written using lxml python library.
his occurrence could actually be replaced by a unicode character, e.g. 卐
Agree.
But as
ben: 1 occurrence in the entry vatsa, empty tag, references a png file (
)
Has a graphic picture, so swastika is not the only one, I would have both as pictures.
Since ap.xml isn’t available online, I couldn’t check whether both elements actually occur.
Do you use skype? I'm gasyoun there. I can send you the link.
A in pw and wil as well as Arabic in mw
I would go for Arabic, because there are too many possibilities to read what A is.
The other low hanging fruit are elements defined in the DTDs, but not used in in the XML files. I would go hunting for these elements.
Fully agree.
There might be an extra 'count' column, right after the tag name, which would be the total number of dictionaries where the tag is found.
If ever made, count should be there, agree.
Regarding the display suggestion made above:
one line per tag element. Instead of the 'binary indicator', make a list of the dictionary codes that contain that tag element. So the suggested format might look like:
key1:ap,ap90,<AND ALL THE REST>,wil,yat
br: acc, ap, bhs, bor, bur, inm, pd, pe, pui, pwg, vei, wil
etc.
Such a format would be easier for programs to work with than the 'binary indicator' idea This tag-occurrence file would be derived from the DTDs.
A program can compare these two tag-dictionary files, and find any cases where there is a tag in the DTD that does not occur in the corresponding XML. Such DTDs can be altered first.
Re the svastika character. Agree it should be replaced by one of the Unicode characters. We might still want to keep some tag whose text would be this character.
We don't have a picture, and it would be awkward to introduce pictures, I think. So let's stick with unicode characters.
Probably the name of the tag should not be <pic>
. There are some special characters (metric long-shorts) that have been used in some dictionaries (can't remember which at the moment), and I think these are currently untagged. Maybe some tag name could identify these as well as the svastika. Perhaps this could be something like a <special>
tag, maybe with an attribute indicating the type of special character, such as <special type="svastika">
or <special type="meter">
. Another tag name possibility would be <symbol>
. Maybe there is some TEI-approved tag name for such situations?
Maybe there is some TEI-approved tag name for such situations?
http://cikitsa.blogspot.ru/ might know.
Thanks Marcis, I got the xml from Jim.
Just some short remarks to the pic/Picture elements.
I think the case in wil is special, because it is an inline character in the original.
In TEI, the appropriate element should be c
(character, http://www.tei-c.org/release/doc/tei-p5-doc/de/html/ref-c.html). So, that would be <c n="swastika">࿕</c>
The <Picture/>
elements in vcp are stand-ins for figures:
There is no good way to represent this in xml, except by embedding picture (or drawing svgs or something along that line).
In ben the case is more similar to wil. The character/picture is a character, more or less an inline character:
but there is no unicode code point for this symbol (as far as I know). So the solution that is currently in place:
seems appropriate.
(As for the other issues, I will try follow Jim’s suggestions about the DTD table and the unused elements as soon as possible, but I’m heading to China on Saturday to teach language documentation. I don’t know whether I can get it done before.)
drawing svgs
Oh that would be a good solution. Or Corel trace.
no unicode code point for this symbol (as far as I know).
Indeed none.
I don’t know whether I can get it done before
Work stuck?
Only once this is done #117 and #98 can take off fully. Please give us your XML and DTD skills.
Let us have only one DTD please.
Now that the meta-iast conversion work is winding down, we can fruitfully revisit the one-dtd topic.
As a first contribution, here is a gist summary of the xml tags that occur in the various dictionaries.
This gist has two files:
<hwtype n="X">
<hwtype ref="Y">
In this example the tags are probably like <hwtype n="X" ref="Y">
Things to look for are where we can revise some tag choices.
For instance, <P/>
occurs in only ap90 and ben, might be replace by the <div>
tag, with an
appropriate attribute -- Maybe <div n="P"/>
. <br/>
and <lb/>
also might be better
coded with <div>
.
Some of the tags, such as <div>
are always used with attributes, and we need to examine the
attributes, and aim to make the attribute names have similar meanings across dictionaries.
<lang n="X">
),
use similarly spelled values of X
.
<lang>
, <div>
, maybe a couple of others.@funderburkjim, one DTD to rule them all please.
@fxru There have been major changes in XMLs in past some time. Can you post your code and regenerate the summary of various tags please?
https://github.com/sanskrit-lexicon/csl-pywork/issues/9 tracks latest development. Closing this. Survey of fxru was the best outcome of this issue.
We are beginning to think about enhancements to the various Cologne Sanskrit Lexicon dictionaries, such as in the alternate head words repository.
It is likely that we will want to add markup to the xml form of dictionaries, since this is a good way to make the enhancements accessible.
There are also enhancements, such as the identification of foreign language words (such as Arabic, Russian, Greek) in various dictionaries. Currently there is no well-thought-out system for markup of these. In fact, there are various choices of markup that the different dictionaries use, and in some cases there is even no markup.
It seems that a good preliminary step is to review the existing markup, as specified in the DTD files for the xml form of the various dictionaries. It is likely that some simple changes in markup will be readily apparent, that will resolve some differences. Of course, there will also likely be some differences that may have to remain.