funderburkjim commented 8 years ago

We are beginning to think about enhancements to the various Cologne Sanskrit Lexicon dictionaries, such as in the alternate head words repository.

It is likely that we will want to add markup to the xml form of dictionaries, since this is a good way to make the enhancements accessible.

There are also enhancements, such as the identification of foreign language words (such as Arabic, Russian, Greek) in various dictionaries. Currently there is no well-thought-out system for markup of these. In fact, there are various choices of markup that the different dictionaries use, and in some cases there is even no markup.

It seems that a good preliminary step is to review the existing markup, as specified in the DTD files for the xml form of the various dictionaries. It is likely that some simple changes in markup will be readily apparent, that will resolve some differences. Of course, there will also likely be some differences that may have to remain.

funderburkjim commented 8 years ago

Recently I received an email from Felix Rau, who is interested in helping with our work on the Cologne Sanskrit Lexicon. With his permission, I am inserting the body of that email here.

I’m writing to you, because I wanted to contribute to some of my time to the Cologne Digital Sanskrit 
Lexicon. However, the Cologne Digital Sanskrit Lexicon is so active on Github that it is difficult 
to see what would be most helpful to you. Any pointers to where I can contribute something
 constructive are more than welcome.
 (My relevant skills are: XML, Sanskrit, basic XSLT, more read than write Python.)

Just to give you some context: I’m working in the linguistics department at the University of Cologne. 
I’m a linguist and Indologist by training, but despite having studied Sanskrit, it is more of a 
hobby of mine. Back then, I studied Tamil under Thomas Malten and worked with him in Orissa. 
Now, I work on Munda and Austroasiatic languages and on archiving audio-visual data from
language documentation.

Because of my background, I have been peripherally involved in the Lazarus Projekt of the CCeH. 
Basically, I gave the CCeH guys some feedback and in particular looked into a few Sanskrit-specific 
things for them. The actual work is over a year old, but I recently formatted it into a sort of
 report so it wouldn’t go to waste, well at least not without a digital trace. 
You can find the two texts here:

https://github.com/fxru/vedic-accent-in-lexicography
https://github.com/fxru/Sanskrit-lf-x-report

I would love to contribute and if you could point me to the parts where I can be the most help,
 I would be happy.

funderburkjim commented 8 years ago

I mentioned this dtd review task as a concrete way to get started (since Felix is comfortable with xml). And he's begun the review of the various dtds.

He's a native of Germany, so can help us when we have issues involving German language.

I also added Felix as a member of the sanskrit-lexicon Github organization.

Welcome, Felix!

gasyoun commented 8 years ago

Felix, Sie sind wilkommen! @funderburkjim what's his nickname? If he has XML expertise than DTD is great for him. I have a long love story with XML, but there are no children from that love, so let Felix do what he can do best. Sanskrit NLP will want all of his spare time and even more!

drdhaval2785 commented 8 years ago

Welcome Felix. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/181 Here we noted somethings which we thought important aboutba year ago. Not much has progresses in these issues since then. Maybe you can choose and pick simething out of it. Best Wishes, Dr. Dhaval Patel

fxru commented 8 years ago

Thanks, धन्यवाद, and спаси́бо you all!

As Jim already said, I started looking into the different DTDs, I made it through all but mw.dtd so far, but I only have a superficial understanding of the issues so far. As a first impression, I can report that not two DTDs are identical in regard to the inventory of elements (or their definition).

Some things seem rather obvious (for now). For example, 12 dictionaries use <br/> to indicate line breaks, while 18 use <lb/> (while 4 seem to use neither). Right now, I can’t see any good reason, why there are two different elements encoding line breaks.

Another example is that skd and vcp use <C> and a number attribute to mark columns, while pe, inm, and vei use different elements – <C1>, <C2>, <C3>... – to achieve the same.

I will report about the comparison of the 35 DTDs here, but add additional issues in particular repositories, when I stumble upon lexicon specific issues, so that these things are documented somewhere. (For starters, I opened one issue in pwg: https://github.com/sanskrit-lexicon/PWG/issues/21 – if you think there is a better way to deal with things I come across, let me know.)

funderburkjim commented 8 years ago

Re <lb/> and <br/>
Probably no good reason to have both. There were two places where rather arbitrary choices crept in. First, in the digitizations that Thomas did, and second in the xml conversions that I did. Both were done with no specific eye to uniformity across dictionaries.

I hope that when these arbitrary choices are resolved, there will be a much smaller number of xml forms that will handle all the cases. I am imagining some kind of template-based framework into which all the dtds will fit, with an allied framework to generate the programs that create and display the xml forms from the .txt digitizations. My current candidate for a templating program to use in this is the Python mako templating system. But so far, this is thoughtware, not software.

When you document the instances involving multiple dictionaries, I suggest you do so in a separate issue in this repository. Also, for dictionary-specific issues, you can follow your pwg model if such a repository exists for the dictionary; if no such repository exists, we'll have to decide if a new one should be opened.

gasyoun commented 8 years ago

धन्यवाद might want to have a singular or plural ending, be it धन्यवादः!

report that not two DTDs are identical in regard to the inventory of element

It's good that at least they are UTF8 now, so you can easily compare. Just two years ago nobody but Jim even cared about the DTDs, because one could not even download a lot (only MW was there). But thanks to GitHub things have changed.

Let's replace it.

Another example is that skd and vcp use and a number attribute to mark columns, while pe, inm, and vei use different elements – , , ... – to achieve the same.

What would you think work best?

template-based framework into which all the dtds will fit

Yeah, 35 DTDs is 34 too many, agree.

fxru commented 8 years ago

I surveyed the different DTDs (minus mw for now), I tried to document it and put it here:

https://github.com/fxru/CDSL-DTD-comparison/blob/master/comparison_CDSL_DTDs.csv

I will keep updating it, but now you can take a look for yourself.

`<br/>` vs. `<lb/>`

The two are defined in the following DTDs: br: acc, ap, bhs, bor, bur, inm, pd, pe, pui, pwg, vei, wil lb: ae, ap, ap90, ben, bop, gst, ieg, krm, mci, md, mw72, mwe, pgn, shs, skd, snp, vcp, yat

However the element might not occur in the dictionaries themselves. (E.g. no occurrence of <br/> in bur.)

The crucial dictionary is ap.xml, because ap.dtd defines br as well as lb. Since ap.xml isn’t available online, I couldn’t check whether both elements actually occur. The question here is: Do both occur in ap? If yes, do both simply encode line breaks or do they encode different things?

If the situation in ap is not t would make sense to decided on either br or lb and homogenize the 29 DTDs for line breaks.

`pic` vs. `Picture`

pic occurs in ben and wil, while Picture occurs in vcp.

ben: 1 occurrence in the entry vatsa, empty tag, references a png file (<pic name='vatsa.png'/>)

wil: 1 occurrence in the entry svastika, non-empty (<pic>svastica</pic>). This occurrence could actually be replaced by a unicode character, e.g. 卐 U+5350 wàn (Chinese block) or ࿕ U+0FD5 RIGHT-FACING SVASTI SIGN (Tibetan block)

vcp: 71 occurrences in 4 entries, all are empty (<Picture/>) and don’t provide any further information.

`C1` vs. `C`+`@n`

The case of C1 up to C11in inm, pe, and vei vs. C plus a number attribute @n in skd and vcp is slightly more complex, as inm also also has C2H and C3H.

However, the bigger problem is, that those actually encode tables and are some of the most complex parts I have seen in the dictionaries, so far.

`A` vs. `Arabic`

A in pw and wil as well as Arabic in mw, all marking Arabic text spans, seems to be another candidate for harmonization of the different DTDs, but I haven’t looked into it further.

Unused elements

The other low hanging fruit are elements defined in the DTDs, but not used in in the XML files. I would go hunting for these elements. I saw that Jim removed UL already from pwg. There seem to be several more instances of unused elements around.

funderburkjim commented 8 years ago

This is helpful.

I found your big dtd table hard to read, mainly because the rows are so long.

I wonder if you could extract from your big table another table.
It might be sort of like a transpose (matrix transpose) of the given big table. The columns would be dictionary codes. The rows would be tag names (perhaps with some brief annotation for short names like 'A', indicating the purpose, if known, such as 'Arabic').. The entry would be some binary indicator (Y/N, YES/NO, +/-) .

There might be an extra 'count' column, right after the tag name, which would be the total number of dictionaries where the tag is found.

This might further help us see the whole picture of the dtds.

I like the idea of dealing with the 'low-hanging fruit'.

Removing unused tags seems like the lowest-hanging fruit. Do you have any tool that reads an xml file and produces some kind of tag analysis ? This would help in identifying the unused tags (i.e., the tags that appear in the dtd for a dictionary, but that do not occur within the xml file for the dictionary.

Seems like there might be an off-the-shelf such tool, but if not, a simple form of such a tool probably could be written using lxml python library.

gasyoun commented 8 years ago

his occurrence could actually be replaced by a unicode character, e.g. 卐

Agree.

But as

ben: 1 occurrence in the entry vatsa, empty tag, references a png file ()

Has a graphic picture, so swastika is not the only one, I would have both as pictures.

Since ap.xml isn’t available online, I couldn’t check whether both elements actually occur.

Do you use skype? I'm gasyoun there. I can send you the link.

A in pw and wil as well as Arabic in mw

I would go for Arabic, because there are too many possibilities to read what A is.

The other low hanging fruit are elements defined in the DTDs, but not used in in the XML files. I would go hunting for these elements.

Fully agree.

There might be an extra 'count' column, right after the tag name, which would be the total number of dictionaries where the tag is found.

If ever made, count should be there, agree.

funderburkjim commented 8 years ago

Regarding the display suggestion made above:

one line per tag element. Instead of the 'binary indicator', make a list of the dictionary codes that contain that tag element. So the suggested format might look like:
```
key1:ap,ap90,<AND ALL THE REST>,wil,yat
br: acc, ap, bhs, bor, bur, inm, pd, pe, pui, pwg, vei, wil
etc.
```
Such a format would be easier for programs to work with than the 'binary indicator' idea This tag-occurrence file would be derived from the DTDs.
It should probably be my task to make a similar file of tag-usage derived from the XMLs. So, skip the idea of looking for a tool to summarize the tags. The reason I should do it is that I have direct programmatic access to all the xml files on the Cologne server; otherwise, you'd have to first download all the xml files.

A program can compare these two tag-dictionary files, and find any cases where there is a tag in the DTD that does not occur in the corresponding XML. Such DTDs can be altered first.

funderburkjim commented 8 years ago

Re the svastika character. Agree it should be replaced by one of the Unicode characters. We might still want to keep some tag whose text would be this character.

We don't have a picture, and it would be awkward to introduce pictures, I think. So let's stick with unicode characters.

Probably the name of the tag should not be <pic>. There are some special characters (metric long-shorts) that have been used in some dictionaries (can't remember which at the moment), and I think these are currently untagged. Maybe some tag name could identify these as well as the svastika. Perhaps this could be something like a <special> tag, maybe with an attribute indicating the type of special character, such as <special type="svastika"> or <special type="meter">. Another tag name possibility would be <symbol>. Maybe there is some TEI-approved tag name for such situations?

gasyoun commented 8 years ago

Maybe there is some TEI-approved tag name for such situations?

http://cikitsa.blogspot.ru/ might know.

fxru commented 8 years ago

Thanks Marcis, I got the xml from Jim.

Just some short remarks to the pic/Picture elements.

I think the case in wil is special, because it is an inline character in the original.

In TEI, the appropriate element should be c (character, http://www.tei-c.org/release/doc/tei-p5-doc/de/html/ref-c.html). So, that would be <c n="swastika">࿕</c>

The <Picture/> elements in vcp are stand-ins for figures:

There is no good way to represent this in xml, except by embedding picture (or drawing svgs or something along that line).

In ben the case is more similar to wil. The character/picture is a character, more or less an inline character:

but there is no unicode code point for this symbol (as far as I know). So the solution that is currently in place:

seems appropriate.

(As for the other issues, I will try follow Jim’s suggestions about the DTD table and the unused elements as soon as possible, but I’m heading to China on Saturday to teach language documentation. I don’t know whether I can get it done before.)

gasyoun commented 7 years ago

drawing svgs

Oh that would be a good solution. Or Corel trace.

no unicode code point for this symbol (as far as I know).

Indeed none.

I don’t know whether I can get it done before

Work stuck?

drdhaval2785 commented 7 years ago

116 requires you to complete this job @fxru .

Only once this is done #117 and #98 can take off fully. Please give us your XML and DTD skills.

Let us have only one DTD please.

funderburkjim commented 6 years ago

Current status of xml markup

Now that the meta-iast conversion work is winding down, we can fruitfully revisit the one-dtd topic.

As a first contribution, here is a gist summary of the xml tags that occur in the various dictionaries.

This gist has two files:

check_tags_inventory.txt has a line for each tag which occurs in the xxx.xml form of each dictionary xxx. It is a csv file (':' separator), with fields:
- dictionary code
- tag-attribute : There may be no attribute. A tag occurring with different attributes will contribute more than one line. For instance:
  - acc:hwtype-n:1592 xml tag like <hwtype n="X">
  - acc:hwtype-ref:1592 xml tag like <hwtype ref="Y"> In this example the tags are probably like <hwtype n="X" ref="Y">
- count of occurrences of the tag-attribute for the given dictionary
summary.txt This provides one useful summary of the information in the inventory file. It shows a 'by tag' summary. Here we ignore the attribute(s) and consider only the tag without attribute.
- tags that occur in all the xxx.xml files
- tags that occur in exactly one of xxx.xml files, and which dictionary, and how many times
- tags that occur in some intermediate number of dictionaries, with a list of those dictionaries; this section is sorted by the number of dictionaries.

Some uses.

Things to look for are where we can revise some tag choices. For instance, <P/> occurs in only ap90 and ben, might be replace by the <div> tag, with an appropriate attribute -- Maybe <div n="P"/>. <br/> and <lb/> also might be better coded with <div>.

Some of the tags, such as <div> are always used with attributes, and we need to examine the attributes, and aim to make the attribute names have similar meanings across dictionaries.

General principles

Reduce the number of different tag names
Use a given tag name consistently across dictionaries
Similarly, use a given tag-with-attribute consistently
Similarly, for tags with a small number of values for a given attribute name (e.g. <lang n="X">), use similarly spelled values of X .
- A more detailed summary will be required for this step -- but it will only be relevant for a small number of tags, such as <lang>, <div>, maybe a couple of others.

drdhaval2785 commented 5 years ago

@funderburkjim, one DTD to rule them all please.

drdhaval2785 commented 5 years ago

@fxru There have been major changes in XMLs in past some time. Can you post your code and regenerate the summary of various tags please?

drdhaval2785 commented 3 years ago

https://github.com/sanskrit-lexicon/csl-pywork/issues/9 tracks latest development. Closing this. Survey of fxru was the best outcome of this issue.

sanskrit-lexicon / COLOGNE