Closed AnnaParla closed 1 year ago
Yes, it can happen more often then otherwise (uk utterances labeled as ru). We can use this issue to record it (please use permanent links)
https://github.com/ufal/ParlaMint-UA/blob/d58289b7f42621bd5a052ab8127dd17fece12911/ParlaMint-UA.TEI/2012/ParlaMint-UA_2012-12-04-m0.xml#L565-L569
should be xml:lang="ru"
<(please use permanent links)>
Trying to... What is the most effective way of extracting permanent links from this page (group of files)? https://github.com/ufal/ParlaMint-UA/commit/d58289b7f42621bd5a052ab8127dd17fece12911#diff-1e06b60ff00a76857f9a4f4a4aaf9a524d0773038051b38e35d7e6e87f506193R338 I do not see the three dots on the left here.
Or shall I do it from a different page?
<(please use permanent links)>
Trying to... What is the most effective way of extracting permanent links from this page (group of files)? d58289b#diff-1e06b60ff00a76857f9a4f4a4aaf9a524d0773038051b38e35d7e6e87f506193R338
browse data, not diff: https://github.com/ufal/ParlaMint-UA/tree/data/ParlaMint-UA.TEI then
Ok, got it
Checking some samples you send, it seems that even chairman use ru
, I set chairmans speeches to uk
, when they are short. I thought chairman uses uk
...
https://github.com/ufal/ParlaMint-UA/blob/cc16eadb6c3b62b472f9640f018c218ee94ebf30/Scripts/lang-detect.pl#L95-L97
You can stop with manual listing of wrong language identification - this is probably enough for testing.
uk
from ru
? I can probably add it into script< Can you list some common words that distinguish uk from ru? I can probably add it into script >
Letter ы for ru, which is very frequent. Never used in ua Letter i for ua, which is very frequent. Never used in ru
But there are some short utterances that do not have these distinctive letters. E.g. спасибо / Спасибо in ru for "thank you"
< I thought chairman uses uk... >
Commonly he does. But in 2012-2014 Rada, Rybak could switch to ru for a sentence or two, or for even a few words in a sentence (we do not touch those cases).
Question: -- this communist leader switches multiple times and says most of the last two sentences in ru but over 50% of this speech is in ua. We shall ignore these cases, right?
And here is another example of horrible mixture:
The ы / i distinction is useful, but not in cases when there are spelling mistakes like here. The whole utterance is in ru.
ok, so I suggest to do these changes:
remove this lines = allow short chairman speeches: https://github.com/ufal/ParlaMint-UA/blob/cc16eadb6c3b62b472f9640f018c218ee94ebf30/Scripts/lang-detect.pl#L95-L97
count number of significant numbers and more ooften determine language = expecting small number of typos.
Currently it firstly expects uk
and then seeks for ru
characters
https://github.com/ufal/ParlaMint-UA/blob/cc16eadb6c3b62b472f9640f018c218ee94ebf30/Scripts/lang-detect.pl#L168-L172
Ok, let's try this and see the result. Will you redo the whole corpus this way?
the result of removing these lines https://github.com/ufal/ParlaMint-UA/blob/cc16eadb6c3b62b472f9640f018c218ee94ebf30/Scripts/lang-detect.pl#L95-L97 is here: 101be2e
it does not looks like a good idea - too many changes and a lot of recent chairman speeches is taged with ru
too...
Well, some problems were fixed, others were created. I don't mind fixing some portions manually, but I cannot edit this version https://github.com/ufal/ParlaMint-UA/commit/101be2ed72c4bbcf714ba7c51bed63bbd72aada6
Do you want me to edit a few sittings so that you can decide whether these changes have caused more harm or provided more solutions?
Is there a way to decline or accept changes in this format? All changes should be declined in:
All changes should be accepted in:
Mixed results: 03t-tei-text-lang/2012/ParlaMint-UA_2012-12-13-m0.xml
Most short utterances by the chairmen in https://github.com/ufal/ParlaMint-UA/commit/101be2ed72c4bbcf714ba7c51bed63bbd72aada6 are in Ukrainian, maybe 5% are in ru (if I eyeball it accurately)
count number of significant numbers and more ooften determine language = expecting small number of typos. Currently it firstly expects
uk
and then seeks forru
characters
implemented in cf92cf7
But there are still, missclassifications.
list of changes done so far: https://github.com/ufal/ParlaMint-UA/compare/44d52f54af90df2cece0a73239e5fa8b80bc49df...cf92cf79fd5e982f2b05685017cd3a8ea0f960ac
the most common Russian paragraphs that dont contain ы
/э
/ъ
find Data/tei-text-lang/_ISSUE-47/ -type f| xargs cat| grep -F 'seg'|grep -F 'lang="ru"'|sed 's/.*"ru">//;s@</seg>@@'|sort|uniq -c|sort -n|grep -v '[ыэъ]'
CNT PARAGRAPH:
16 Дайте 10 секунд.
17 30 секунд завершити дайте, будь ласка.
17 30 секунд. Завершуйте.
17 30 секунд, завершуйте, будь ласка.
17 Врахувати.
17 Дайте завершити 10 секунд.
18 15 секунд.
18 Будь ласка, 30 секунд додайте.
18 Кому?
19 10 секунд дайте завершити.
22 10 секунд, завершуйте, будь ласка.
22 Добре.
23 Будь ласка, завершуйте, 30 секунд.
23 Врахована.
24 10 секунд завершити.
32 Князевич Руслан Петрович.
35 10 секунд, будь ласка, завершуйте.
35 Поляков Антон Едуардович.
40 Будь ласка, 30 секунд. Завершуйте.
41 Цимбалюк Михайло Михайлович.
43 15 секунд завершити.
48 Дайте 10 секунд завершити.
49 Будь ласка, 10 секунд, завершуйте.
52 30 секунд, будь ласка, завершуйте.
53 Да.
61 Спасибо.
72 30 секунд, завершуйте.
74 Так.
85 10 секунд, завершуйте.
89 30 секунд завершити.
89 Дякую вам.
122 Прошу.
125 Будь ласка, 30 секунд, завершуйте.
152 Прошу передати.
188 Завершуйте.
269 10 секунд.
442 30 секунд.
753 Дякую.
In https://github.com/ufal/ParlaMint-UA/compare/44d52f54af90df2cece0a73239e5fa8b80bc49df...cf92cf79fd5e982f2b05685017cd3a8ea0f960ac all long utterances look ok, but about 90%+ of short utterances by the Chairman should be in Ukrainian!
Words as follows are all in Ukrainian:
дякую дякую вам прошу прошу, [followed by "PER name"] прошу передати я вас прошу завершуйте дайте завершити хвилину прошу, народний депутат [followed by "PER name"] продовжуйте запрошую до слова побратим Сиротюк виступайте будь ласка добре врахована яку? яка ... пройшли вже оце оприлюдню це брехня я проводив ну,так, нема, ну, так який номер? давайте я сприймаю критику ... and names of MPs standing alone (so far I came across 1 case where chairman Rybak clearly gives the floor to a ru speaking MP in ru)
the most common Russian paragraphs that dont contain
ы
/э
/ъ
find Data/tei-text-lang/_ISSUE-47/ -type f| xargs cat| grep -F 'seg'|grep -F 'lang="ru"'|sed 's/.*"ru">//;s@</seg>@@'|sort|uniq -c|sort -n|grep -v '[ыэъ]'
CNT PARAGRAPH:
16 Дайте 10 секунд. 17 30 секунд завершити дайте, будь ласка. 17 30 секунд. Завершуйте. 17 30 секунд, завершуйте, будь ласка. 17 Врахувати. 17 Дайте завершити 10 секунд. 18 15 секунд. 18 Будь ласка, 30 секунд додайте. 18 Кому? 19 10 секунд дайте завершити. 22 10 секунд, завершуйте, будь ласка. 22 Добре. 23 Будь ласка, завершуйте, 30 секунд. 23 Врахована. 24 10 секунд завершити. 32 Князевич Руслан Петрович. 35 10 секунд, будь ласка, завершуйте. 35 Поляков Антон Едуардович. 40 Будь ласка, 30 секунд. Завершуйте. 41 Цимбалюк Михайло Михайлович. 43 15 секунд завершити. 48 Дайте 10 секунд завершити. 49 Будь ласка, 10 секунд, завершуйте. 52 30 секунд, будь ласка, завершуйте. 53 Да. 61 Спасибо. 72 30 секунд, завершуйте. 74 Так. 85 10 секунд, завершуйте. 89 30 секунд завершити. 89 Дякую вам. 122 Прошу. 125 Будь ласка, 30 секунд, завершуйте. 152 Прошу передати. 188 Завершуйте. 269 10 секунд. 442 30 секунд. 753 Дякую.
Because most of them are in Ukrainian!!!
Because most of them are in Ukrainian!!!
yes, but I am not able to mar them as ukrainian because they also don't contain і
/ї
/є
/ґ
. Automatic language result is Russian - two short to determina and no other context in whole utterance
Words as follows are all in Ukrainian:
дякую дякую вам прошу прошу, [followed by "PER name"] прошу передати я вас прошу завершуйте дайте завершити хвилину прошу, народний депутат [followed by "PER name"] продовжуйте запрошую до слова побратим Сиротюк виступайте будь ласка and names of MPs standing alone (so far I came across 1 case where chairman Rybak clearly gives the floor to a ru speaking MP in ru)
At the time of language identification I don't have information about named entities.
I don't want to have a list of phrases. I want to have a safe list of word forms that positively identify ukrainian language. In other words - if any of word forms from list appears in text then it is ukrainian. You can also provide me a Russian language identifying words
My idea is to do identification in this order:
If this will not be ok, I can also remove all words that starts with capital letter and aro not preceded by .
= remove proper names.
added uk word list: https://github.com/ufal/ParlaMint-UA/blob/e85e39f6edd83580cf1ccb9fc33f345a6190baaa/Scripts/lang-detect.pl#L23-L35 current state: https://github.com/ufal/ParlaMint-UA/compare/44d52f54af90df2cece0a73239e5fa8b80bc49df...e53df326295bd150f398df43f2969813b6e330c2
the most common Russian paragraphs that dont contain ы
/э
/ъ
4 2 хвилини.
4 3-я.
4 Артур Герасимов.
4 Будь ласка, Артур Герасимов.
4 Все.
4 Доброго дня!
4 Кишкар Павло Миколайович.
4 Коментуйте.
4 Крулько.
4 Лерос Гео Багратович.
4 Номер?
4 Номер поправки?
4 Олег Березюк.
4 Передайте.
4 Передаю слово Олегу Ляшку.
4 Руслан Петрович Князевич.
4 Цимбалюк.
4 Я завершив.
4 Який номер?
4 Я не почув, кому?
5 Вибачте.
5 Вона врахована.
5 Враховано.
5 Геращенко.
5 Матвиенков, 57 округ, Мариуполь.
5 Михайло Головко.
5 Руслан Князевич.
6 Будь ласка, народний депутат Долженков.
6 Будь ласка, передайте.
6 Князевич.
6 Олена Сотник, будь ласка.
6 Правильно.
6 Так, так.
7 Героям слава!
7 Синютка Олег Михайлович.
8 8-а.
8 Гео Багратович Лерос.
8 Передайте, будь ласка.
8 Так-так.
9 Ще раз?
10 Лаба Михайло Михайлович.
10 Народний депутат Долженков.
11 Дайте договорити.
11 Княжицький.
11 Ще раз.
12 Величкович Микола Романович.
12 Гетманцев Данило Олександрович.
12 Ще раз, кому?
17 Врахувати.
18 Кому?
23 Врахована.
30 Поляков Антон Едуардович.
31 Князевич Руслан Петрович.
41 Цимбалюк Михайло Михайлович.
52 Да.
74 Так.
@AnnaParla I would like to regenerate the whole corpus, but I need to know if the language detection is ok
All of the short utterances under <the most common Russian paragraphs that dont contain ы/э/ъ> are in Ukrainian except for two: 5 Матвиенков, 57 округ, Мариуполь. --- ru 52 Да. --- ru
Maybe at this point it makes sense to create the following dependency:
All short utterances that do not contain ы/э/ъ are in Ukrainian unless they have one or more of the following words: спасибо благодарю пожалуйста передать начать продолжать (продолжает) закончить (заканчивайте) настаивать подготовиться действовать (действую) добавить (добавте) занять подтвердить внимание коллега (коллеги, коллеге) вопрос Александр дальше согласно только большой (большое, большая)
Also, if there is и , which stands alone! и means "and" in ru (although spelling mistakes are possible)
P.S. Sorry about this delay. Urgent health issues in the family...
A number of ru words in the list above are given in their dictionary form. Can you include them as lemmas?
Do you want me to send you all the word forms of those words?
A number of ru words in the list above are given in their dictionary form. Can you include them as lemmas?
Do you want me to send you all the word forms of those words?
I need word forms because I don't have processed them with udpipe. But it is not necessary to have a list of all possible forms - just common/expected ones. So if you can create this list (one word per line with all forms separated by space is ideal).
Frequent words in ru that do not inflect (decline, conjugate) and therefore have only one form: (capital letter insensitive)
и (which stands alone!) с (which stands alone and is not followed by a full stop !) или спасибо пожалуйста хорошо конечно дальше согласно только что (as conjunction and adverb) как когда еще также сразу вот
Frequent ru words in the corpus which inflect (some of them in some of their forms are homonymous with ua), therefore I will list only those which are specific for ru and which are likely used in the corpus:
благодарить благодарю благодарите здравствуйте здравствуй говорить дать давать действовать действую добавить добавте договариваться есть занять закончить закончу заканчивайте надеяться надеюсь надеемся надейтесь настаивать начать передать подать подавать продолжать продолжает подготовиться подготовились подготовтесь подтвердить поддержать поддержите поддерживаем поддерживать предлагать предлагаю поставить применять применяю работать сказать тратить внимание внимания вниманием понимание понимания прощение прощения уважение уважением вопрос вопроса диалог диалоге диалога замечание замечания замечаний замечаниями чтение чтении сессия сессии партия партии регионов регионам фракция фракции коллега коллеги коллеге коллегам коллектив коллективу коллектива работа работу деятельность деятельности администрация администрации господин госпожа Председательствующий меня мне тебя тебе всех всем их им большой большое большая большие большим больше политический политическая главное главная главного главному последний последнее последнего уверен уверена благодарен благодарна нужен нужна нужно Александр Александру Евгений Евгению Михаил Михаилу Николай Николаю Юрий Юрию Сергей Сергею Иван Ивану Татьяна Татьяне Инна Инне Ирина Ирине Раиса Раисе Наталья Наталье Матвиенков Матвиенкову Балицкий Мариуполь Мелитополь Украина Украине
The verb form прошу is homonymous in ua and ru. When it stands alone or with a proper name (which can be also homonymous in ua and ru), I would identify it as ua by default (it will be accurate in probably over 90% of cases). In the frequent phrase: прошу передати --- ua прощу передать --- ru
Probably the most difficult case of homonymy in this corpus is the word да "yes", esp. when it stands alone. It can be used in both languages but is more common and has more meanings in Russian. Also, repetitions like да, да or да-да are more likely to be used in Russian.
Are different word forms of the same headword ok on one line or do you want me to do them one per line?
Are different word forms of the same headword ok on one line or do you want me to do them one per line?
it is ok.
changes have been done so far: https://github.com/ufal/ParlaMint-UA/compare/44d52f54af90df2cece0a73239e5fa8b80bc49df...1a627dc9d2f2e7890c4882b1ba330c70ccd326c6
I use the word decision for ru only for short paragraphs: https://github.com/ufal/ParlaMint-UA/blob/baebfe15ed25ca0f3f1ca12f649d54f09e43719b/Scripts/lang-detect.pl#L129
I can use it for longer ones, but I am not sure if these words can rarely be used even in Ukrainian.
Are different word forms of the same headword ok on one line or do you want me to do them one per line?
it is ok.
changes have been done so far: 44d52f5...1a627dc
It looks like some words from my list were not recognized.
The following should be ru:
передать Александру https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-04-m0.xml#L299
всем начать и (which stands alone) https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-12-m2.xml#L127
передать коллеге https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-19-m1.xml#L142
спасибо и (standing alone) только https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-09-04-m0.xml#L155
внимание вопрос еще Николаю и (standing alone) https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-10-09-m0.xml#L1189
And the following utterances should be labeled as ua:
https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-21-m1.xml#L298 In theory, the phrase above is written the same in ua and ru, but it is pronounced in ua (the speaker used the utterance before in ua; that speaker was a famous Ukr writer and never used ru in public settings; that chairman was likely to use ua in response to ua, esp. if he linked two ua speakers). But this is something the software does not know :)
All these phrases above do not meet the formal criteria for ru that we agreed on: any of the unique ru characters or a word from the list for short utterances. Why were they labeled as ru anyway?
Multiple mistakes in one of the most frequent formulaic phrases, which looks similar in ua and ru:
Прошу передати слово --- always ua
Прошу передать слово --- always ru
Прошу (standing alone) --- let it always be ua in these contexts (the word is polysemous and can be used in both ua and ru, but when it stands alone, it is commonly used by a chairperson for encouragement or permission, which is typical of ua)
All these phrases above do not meet the formal criteria for ru that we agreed on: any of the unique ru characters or a word from the list for short utterances. Why were they labeled as ru anyway?
short (<=50) utterances are fixed https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-21-m1.xml#L298
current state: https://github.com/ufal/ParlaMint-UA/compare/44d52f54af90df2cece0a73239e5fa8b80bc49df...76fa47f7344b8f64a28947e01df0ce291141974a
Should be in ru (contain words from the list):
передать Александру https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-04-m0.xml#L299
и (standing alone) начать спасибо https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-12-m2.xml#L127
передать коллеге https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-19-m1.xml#L142
Should be in ua (do not contain ru characters or words from the list):
Should be in ru (these words are not on the list yet)
спасибі --- always ua
спасибо --- always ru https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-22-m0.xml#L822
This needs to be fixed somehow. The utterance below is in ua. Chairman Stefanchuk is a ua native speaker and he sticks to ua all the time, but he may use some individual words that are the norm in ru and not in ua: дальше is case in point.
What is technically more feasible: to remove дальше from the list of ru words or fix this utterance manually?
Term 9
The following are ua, no grounds to label them otherwise:
Spelling mistakes in the original file that cause identification problems:
Should be ru:
The algorithm is this:
uk/ru
uk
ru
uk
uk
or ru
then uk
is used)so:
uk
word: Прошу
https://github.com/ufal/ParlaMint-UA/blob/d3628c162176185dd52c2c4975293e765d87adc6/Scripts/lang-detect.pl#L29
https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-04-m0.xml#L299
probably the list of Ukrainian words needs to be reviewed too: https://github.com/ufal/ParlaMint-UA/blob/d3628c162176185dd52c2c4975293e765d87adc6/Scripts/lang-detect.pl#L22-L37
ru utterances are sometimes labeled as uk, e.g. seg xml:id="ParlaMint-UA_2012-12-04-m0.u46.p1" xml:lang="uk" Это не внешнее, это облуживание долга. Это процентные выплаты....
Not sure if it is better to comment on them right in the tei files or list them separately.