langdoc / FRechdoc

Technical documentation about tools used in Freiburg. ELAN, XeLaTeX and RMarkdown templates.
3 stars 1 forks source link

Are tier definitions for all transciption systems satisfactory #6

Closed nikopartanen closed 7 years ago

nikopartanen commented 7 years ago

We have now at least following linguistic types:

phonet-upaT phonol-cyrT translit-latT

However, it is maybe bit unelegant this way, as UPA can be used on phonetic or phonemic exactness, and same with other systems like IPA too. And transliteration can be also done from anything to anywhere. Can we understand this as an abstract pattern?

[exactness/type/granularity][script]T

If it is a pattern like this, then it probably should be described in Wiki as such.

I think we discussed that we should have here some convention also for marking better what is the source of the text version. Often these could also be generated on the fly, but then those would be transliterations in all cases, to differentiate from real transcriptions that come from other sources? Maybe also here tag gold could be used to differentiate what has been manually corrected?

jeutzsch commented 7 years ago

this is a good example of what ELAN-LingTypes were originally designed to cover, and it's tempting to use them in that way here. It seems so redundant to create so many individual types, when the behavior of these tiers is always going to be the same, as far as I can tell. But anyway, perhaps the naming pattern can be something like this: type-standardT where: type = phonet/phonol/translit (...) standard = upa/cyr/lat/ipa (...)

nikopartanen commented 7 years ago

I think that all possible linguistic types should not be in the file, I'm more interested about the fact that if someone encounters a random type and tier in one file, then how easy it is to deduct what it is. Ideally there would be a controlled vocabulary or small explanatory entries somewhere for the terms used here.

Although there maybe are different approaches to modify the system to refer more explicitly to data sources, i.e. publications, I think @meehkal mentioned this some time? If something like this would be done, then in location where the source publication is described there could be an explanatory note on transcription system used. Also with things like UPA we don't really have a transcription system, but a chaotic collection of different conventions and characters used by different authors in specific ways. So there is Rédei-UPA and Vászolyi-UPA and many more. Also Cyrillic can be one of many systems. But I think the name in tier/type can be bit vague, as long as explanation can retrieved somehow. In order to convert from one system another it has to be quite clearly described what is which system. If we think about linked data, then only thing really needed is a link to somewhere where further explanation can be found in a machine readable format. It can be quite simple in the end.

Practical relevance is mainly that the use of transcription mode in ELAN demands distinct types. Then in XPath one has to use something like:

//TIER[starts-with(@TIER_ID, 'translit-LAT')]

Instead of:

//TIER[@LINGUISTIC_TYPE_REF='translit-latT']

Which of course makes no real difference. In ideal world there would not even be capitalization difference here, as then same prefix would map both into type and tier, but especially if we would allow same type used in different purpose tiers in the same file, then only TIER_ID prefix would make those tiers distinguishable anyway.

Maybe I also change all my scripts which depend on linguistic types to rely more on tier name conventions, as those are maybe more consistently used in different projects.

nikopartanen commented 7 years ago

Just to update, I'm now setting up a template to be used in SGU on my course, and I think there the main transcription tier will be called phonol-knt@ with type phonol-kntT, which refers to Коми научнӧй транскрипчия, the specific system used.

People on the course are not transcribing in orthography, so I will not have orthT tier there at all. However, it results in a change where the Russian translation tier, if it would be there, would be directly under the knt tier. Similarly, the word tier is also directly under that. So this is how it looks like:

SGU tier setup

I kind of like this as it is very simple, as it is very easy to extend, and still it doesn't conflict in any way under our current setup. Well, the translations would, but it also leads to more complex questions such as whether the different word-level tiers should be distinguished by their own types, or can they all be just wordT and the type of the parent is enough to resolve what is going on? Same goes with translations: are there cases where the translation or orthT and phonol-cyrT etc. would be needed to differentiate with their own types? The whole scenario is rare, but of course occurs. For example, with lots of older stuff we may have oldish original Finnish and the modern Finnish etc.