ufal / ParlaMint-UA

Tools and samples of Ukrainian parliamentary proceedings encoded in ParlaMint format
https://ufal.github.io/ParlaMint-UA/
0 stars 0 forks source link

ua / ru language identification #47

Closed AnnaParla closed 1 year ago

AnnaParla commented 1 year ago

ru utterances are sometimes labeled as uk, e.g. seg xml:id="ParlaMint-UA_2012-12-04-m0.u46.p1" xml:lang="uk" Это не внешнее, это облуживание долга. Это процентные выплаты....

Not sure if it is better to comment on them right in the tei files or list them separately.

matyaskopp commented 1 year ago

Yes, it can happen more often then otherwise (uk utterances labeled as ru). We can use this issue to record it (please use permanent links)


https://github.com/ufal/ParlaMint-UA/blob/d58289b7f42621bd5a052ab8127dd17fece12911/ParlaMint-UA.TEI/2012/ParlaMint-UA_2012-12-04-m0.xml#L565-L569 should be xml:lang="ru"

AnnaParla commented 1 year ago

<(please use permanent links)>

Trying to... What is the most effective way of extracting permanent links from this page (group of files)? https://github.com/ufal/ParlaMint-UA/commit/d58289b7f42621bd5a052ab8127dd17fece12911#diff-1e06b60ff00a76857f9a4f4a4aaf9a524d0773038051b38e35d7e6e87f506193R338 I do not see the three dots on the left here.

Or shall I do it from a different page?

matyaskopp commented 1 year ago

<(please use permanent links)>

Trying to... What is the most effective way of extracting permanent links from this page (group of files)? d58289b#diff-1e06b60ff00a76857f9a4f4a4aaf9a524d0773038051b38e35d7e6e87f506193R338

browse data, not diff: https://github.com/ufal/ParlaMint-UA/tree/data/ParlaMint-UA.TEI then

AnnaParla commented 1 year ago

Ok, got it

AnnaParla commented 1 year ago
AnnaParla commented 1 year ago
AnnaParla commented 1 year ago
AnnaParla commented 1 year ago
AnnaParla commented 1 year ago
AnnaParla commented 1 year ago
matyaskopp commented 1 year ago

Checking some samples you send, it seems that even chairman use ru, I set chairmans speeches to uk, when they are short. I thought chairman uses uk... https://github.com/ufal/ParlaMint-UA/blob/cc16eadb6c3b62b472f9640f018c218ee94ebf30/Scripts/lang-detect.pl#L95-L97

You can stop with manual listing of wrong language identification - this is probably enough for testing.

AnnaParla commented 1 year ago
AnnaParla commented 1 year ago

< Can you list some common words that distinguish uk from ru? I can probably add it into script >

Letter ы for ru, which is very frequent. Never used in ua Letter i for ua, which is very frequent. Never used in ru

But there are some short utterances that do not have these distinctive letters. E.g. спасибо / Спасибо in ru for "thank you"

AnnaParla commented 1 year ago

< I thought chairman uses uk... >

Commonly he does. But in 2012-2014 Rada, Rybak could switch to ru for a sentence or two, or for even a few words in a sentence (we do not touch those cases).

AnnaParla commented 1 year ago
AnnaParla commented 1 year ago

Question: -- this communist leader switches multiple times and says most of the last two sentences in ru but over 50% of this speech is in ua. We shall ignore these cases, right?

https://github.com/ufal/ParlaMint-UA/blob/d58289b7f42621bd5a052ab8127dd17fece12911/ParlaMint-UA.TEI/2013/ParlaMint-UA_2013-03-21-m0.xml#L505

https://github.com/ufal/ParlaMint-UA/blob/d58289b7f42621bd5a052ab8127dd17fece12911/ParlaMint-UA.TEI/2013/ParlaMint-UA_2013-03-21-m0.xml#L508

And here is another example of horrible mixture:

https://github.com/ufal/ParlaMint-UA/blob/d58289b7f42621bd5a052ab8127dd17fece12911/ParlaMint-UA.TEI/2013/ParlaMint-UA_2013-03-21-m0.xml#L733-L735

AnnaParla commented 1 year ago
AnnaParla commented 1 year ago

The ы / i distinction is useful, but not in cases when there are spelling mistakes like here. The whole utterance is in ru.

matyaskopp commented 1 year ago

ok, so I suggest to do these changes:

remove this lines = allow short chairman speeches: https://github.com/ufal/ParlaMint-UA/blob/cc16eadb6c3b62b472f9640f018c218ee94ebf30/Scripts/lang-detect.pl#L95-L97

count number of significant numbers and more ooften determine language = expecting small number of typos. Currently it firstly expects uk and then seeks for ru characters https://github.com/ufal/ParlaMint-UA/blob/cc16eadb6c3b62b472f9640f018c218ee94ebf30/Scripts/lang-detect.pl#L168-L172

AnnaParla commented 1 year ago

Ok, let's try this and see the result. Will you redo the whole corpus this way?

matyaskopp commented 1 year ago

the result of removing these lines https://github.com/ufal/ParlaMint-UA/blob/cc16eadb6c3b62b472f9640f018c218ee94ebf30/Scripts/lang-detect.pl#L95-L97 is here: 101be2e

it does not looks like a good idea - too many changes and a lot of recent chairman speeches is taged with ru too...

AnnaParla commented 1 year ago

Well, some problems were fixed, others were created. I don't mind fixing some portions manually, but I cannot edit this version https://github.com/ufal/ParlaMint-UA/commit/101be2ed72c4bbcf714ba7c51bed63bbd72aada6

Do you want me to edit a few sittings so that you can decide whether these changes have caused more harm or provided more solutions?

AnnaParla commented 1 year ago

Is there a way to decline or accept changes in this format? All changes should be declined in:

All changes should be accepted in:

Mixed results: 03t-tei-text-lang/2012/ParlaMint-UA_2012-12-13-m0.xml

AnnaParla commented 1 year ago

Most short utterances by the chairmen in https://github.com/ufal/ParlaMint-UA/commit/101be2ed72c4bbcf714ba7c51bed63bbd72aada6 are in Ukrainian, maybe 5% are in ru (if I eyeball it accurately)

matyaskopp commented 1 year ago

count number of significant numbers and more ooften determine language = expecting small number of typos. Currently it firstly expects uk and then seeks for ru characters

https://github.com/ufal/ParlaMint-UA/blob/cc16eadb6c3b62b472f9640f018c218ee94ebf30/Scripts/lang-detect.pl#L168-L172

implemented in cf92cf7


But there are still, missclassifications.

list of changes done so far: https://github.com/ufal/ParlaMint-UA/compare/44d52f54af90df2cece0a73239e5fa8b80bc49df...cf92cf79fd5e982f2b05685017cd3a8ea0f960ac

matyaskopp commented 1 year ago

the most common Russian paragraphs that dont contain ы/э/ъ

find Data/tei-text-lang/_ISSUE-47/ -type f| xargs cat| grep -F 'seg'|grep -F 'lang="ru"'|sed 's/.*"ru">//;s@</seg>@@'|sort|uniq -c|sort -n|grep -v '[ыэъ]'

CNT PARAGRAPH:

     16 Дайте 10 секунд.
     17 30 секунд завершити дайте, будь ласка.
     17 30 секунд. Завершуйте.
     17 30 секунд, завершуйте, будь ласка.
     17 Врахувати.
     17 Дайте завершити 10 секунд.
     18 15 секунд.
     18 Будь ласка, 30 секунд додайте.
     18 Кому?
     19 10 секунд дайте завершити.
     22 10 секунд, завершуйте, будь ласка.
     22 Добре.
     23 Будь ласка, завершуйте, 30 секунд.
     23 Врахована.
     24 10 секунд завершити.
     32 Князевич Руслан Петрович.
     35 10 секунд, будь ласка, завершуйте.
     35 Поляков Антон Едуардович.
     40 Будь ласка, 30 секунд. Завершуйте.
     41 Цимбалюк Михайло Михайлович.
     43 15 секунд завершити.
     48 Дайте 10 секунд завершити.
     49 Будь ласка, 10 секунд, завершуйте.
     52 30 секунд, будь ласка, завершуйте.
     53 Да.
     61 Спасибо.
     72 30 секунд, завершуйте.
     74 Так.
     85 10 секунд, завершуйте.
     89 30 секунд завершити.
     89 Дякую вам.
    122 Прошу.
    125 Будь ласка, 30 секунд, завершуйте.
    152 Прошу передати.
    188 Завершуйте.
    269 10 секунд.
    442 30 секунд.
    753 Дякую.
AnnaParla commented 1 year ago

In https://github.com/ufal/ParlaMint-UA/compare/44d52f54af90df2cece0a73239e5fa8b80bc49df...cf92cf79fd5e982f2b05685017cd3a8ea0f960ac all long utterances look ok, but about 90%+ of short utterances by the Chairman should be in Ukrainian!

Words as follows are all in Ukrainian:

дякую дякую вам прошу прошу, [followed by "PER name"] прошу передати я вас прошу завершуйте дайте завершити хвилину прошу, народний депутат [followed by "PER name"] продовжуйте запрошую до слова побратим Сиротюк виступайте будь ласка добре врахована яку? яка ... пройшли вже оце оприлюдню це брехня я проводив ну,так, нема, ну, так який номер? давайте я сприймаю критику ... and names of MPs standing alone (so far I came across 1 case where chairman Rybak clearly gives the floor to a ru speaking MP in ru)

AnnaParla commented 1 year ago

the most common Russian paragraphs that dont contain ы/э/ъ

find Data/tei-text-lang/_ISSUE-47/ -type f| xargs cat| grep -F 'seg'|grep -F 'lang="ru"'|sed 's/.*"ru">//;s@</seg>@@'|sort|uniq -c|sort -n|grep -v '[ыэъ]'

CNT PARAGRAPH:

     16 Дайте 10 секунд.
     17 30 секунд завершити дайте, будь ласка.
     17 30 секунд. Завершуйте.
     17 30 секунд, завершуйте, будь ласка.
     17 Врахувати.
     17 Дайте завершити 10 секунд.
     18 15 секунд.
     18 Будь ласка, 30 секунд додайте.
     18 Кому?
     19 10 секунд дайте завершити.
     22 10 секунд, завершуйте, будь ласка.
     22 Добре.
     23 Будь ласка, завершуйте, 30 секунд.
     23 Врахована.
     24 10 секунд завершити.
     32 Князевич Руслан Петрович.
     35 10 секунд, будь ласка, завершуйте.
     35 Поляков Антон Едуардович.
     40 Будь ласка, 30 секунд. Завершуйте.
     41 Цимбалюк Михайло Михайлович.
     43 15 секунд завершити.
     48 Дайте 10 секунд завершити.
     49 Будь ласка, 10 секунд, завершуйте.
     52 30 секунд, будь ласка, завершуйте.
     53 Да.
     61 Спасибо.
     72 30 секунд, завершуйте.
     74 Так.
     85 10 секунд, завершуйте.
     89 30 секунд завершити.
     89 Дякую вам.
    122 Прошу.
    125 Будь ласка, 30 секунд, завершуйте.
    152 Прошу передати.
    188 Завершуйте.
    269 10 секунд.
    442 30 секунд.
    753 Дякую.

Because most of them are in Ukrainian!!!

matyaskopp commented 1 year ago

Because most of them are in Ukrainian!!!

yes, but I am not able to mar them as ukrainian because they also don't contain і/ї/є/ґ. Automatic language result is Russian - two short to determina and no other context in whole utterance

Words as follows are all in Ukrainian:

дякую дякую вам прошу прошу, [followed by "PER name"] прошу передати я вас прошу завершуйте дайте завершити хвилину прошу, народний депутат [followed by "PER name"] продовжуйте запрошую до слова побратим Сиротюк виступайте будь ласка and names of MPs standing alone (so far I came across 1 case where chairman Rybak clearly gives the floor to a ru speaking MP in ru)

At the time of language identification I don't have information about named entities.

I don't want to have a list of phrases. I want to have a safe list of word forms that positively identify ukrainian language. In other words - if any of word forms from list appears in text then it is ukrainian. You can also provide me a Russian language identifying words

My idea is to do identification in this order:

  1. check characters
  2. check words
  3. language identification with Lingua::Identify::Any

If this will not be ok, I can also remove all words that starts with capital letter and aro not preceded by . = remove proper names.

matyaskopp commented 1 year ago

added uk word list: https://github.com/ufal/ParlaMint-UA/blob/e85e39f6edd83580cf1ccb9fc33f345a6190baaa/Scripts/lang-detect.pl#L23-L35 current state: https://github.com/ufal/ParlaMint-UA/compare/44d52f54af90df2cece0a73239e5fa8b80bc49df...e53df326295bd150f398df43f2969813b6e330c2

the most common Russian paragraphs that dont contain ы/э/ъ

      4 2 хвилини.
      4 3-я.
      4 Артур Герасимов.
      4 Будь ласка, Артур Герасимов.
      4 Все.
      4 Доброго дня!
      4 Кишкар Павло Миколайович.
      4 Коментуйте.
      4 Крулько.
      4 Лерос Гео Багратович.
      4 Номер?
      4 Номер поправки?
      4 Олег Березюк.
      4 Передайте.
      4 Передаю слово Олегу Ляшку.
      4 Руслан Петрович Князевич.
      4 Цимбалюк.
      4 Я завершив.
      4 Який номер?
      4 Я не почув, кому?
      5 Вибачте.
      5 Вона врахована.
      5 Враховано.
      5 Геращенко.
      5 Матвиенков, 57 округ, Мариуполь.
      5 Михайло Головко.
      5 Руслан Князевич.
      6 Будь ласка, народний депутат Долженков.
      6 Будь ласка, передайте.
      6 Князевич.
      6 Олена Сотник, будь ласка.
      6 Правильно.
      6 Так, так.
      7 Героям слава!
      7 Синютка Олег Михайлович.
      8 8-а.
      8 Гео Багратович Лерос.
      8 Передайте, будь ласка.
      8 Так-так.
      9 Ще раз?
     10 Лаба Михайло Михайлович.
     10 Народний депутат Долженков.
     11 Дайте договорити.
     11 Княжицький.
     11 Ще раз.
     12 Величкович Микола Романович.
     12 Гетманцев Данило Олександрович.
     12 Ще раз, кому?
     17 Врахувати.
     18 Кому?
     23 Врахована.
     30 Поляков Антон Едуардович.
     31 Князевич Руслан Петрович.
     41 Цимбалюк Михайло Михайлович.
     52 Да.
     74 Так.
matyaskopp commented 1 year ago

@AnnaParla I would like to regenerate the whole corpus, but I need to know if the language detection is ok

AnnaParla commented 1 year ago

All of the short utterances under <the most common Russian paragraphs that dont contain ы/э/ъ> are in Ukrainian except for two: 5 Матвиенков, 57 округ, Мариуполь. --- ru 52 Да. --- ru

AnnaParla commented 1 year ago

Maybe at this point it makes sense to create the following dependency:

All short utterances that do not contain ы/э/ъ are in Ukrainian unless they have one or more of the following words: спасибо благодарю пожалуйста передать начать продолжать (продолжает) закончить (заканчивайте) настаивать подготовиться действовать (действую) добавить (добавте) занять подтвердить внимание коллега (коллеги, коллеге) вопрос Александр дальше согласно только большой (большое, большая)

Also, if there is и , which stands alone! и means "and" in ru (although spelling mistakes are possible)

P.S. Sorry about this delay. Urgent health issues in the family...

AnnaParla commented 1 year ago

A number of ru words in the list above are given in their dictionary form. Can you include them as lemmas?

Do you want me to send you all the word forms of those words?

matyaskopp commented 1 year ago

A number of ru words in the list above are given in their dictionary form. Can you include them as lemmas?

Do you want me to send you all the word forms of those words?

I need word forms because I don't have processed them with udpipe. But it is not necessary to have a list of all possible forms - just common/expected ones. So if you can create this list (one word per line with all forms separated by space is ideal).

AnnaParla commented 1 year ago

Frequent words in ru that do not inflect (decline, conjugate) and therefore have only one form: (capital letter insensitive)

и (which stands alone!) с (which stands alone and is not followed by a full stop !) или спасибо пожалуйста хорошо конечно дальше согласно только что (as conjunction and adverb) как когда еще также сразу вот

Frequent ru words in the corpus which inflect (some of them in some of their forms are homonymous with ua), therefore I will list only those which are specific for ru and which are likely used in the corpus:

благодарить благодарю благодарите здравствуйте здравствуй говорить дать давать действовать действую добавить добавте договариваться есть занять закончить закончу заканчивайте надеяться надеюсь надеемся надейтесь настаивать начать передать подать подавать продолжать продолжает подготовиться подготовились подготовтесь подтвердить поддержать поддержите поддерживаем поддерживать предлагать предлагаю поставить применять применяю работать сказать тратить внимание внимания вниманием понимание понимания прощение прощения уважение уважением вопрос вопроса диалог диалоге диалога замечание замечания замечаний замечаниями чтение чтении сессия сессии партия партии регионов регионам фракция фракции коллега коллеги коллеге коллегам коллектив коллективу коллектива работа работу деятельность деятельности администрация администрации господин госпожа Председательствующий меня мне тебя тебе всех всем их им большой большое большая большие большим больше политический политическая главное главная главного главному последний последнее последнего уверен уверена благодарен благодарна нужен нужна нужно Александр Александру Евгений Евгению Михаил Михаилу Николай Николаю Юрий Юрию Сергей Сергею Иван Ивану Татьяна Татьяне Инна Инне Ирина Ирине Раиса Раисе Наталья Наталье Матвиенков Матвиенкову Балицкий Мариуполь Мелитополь Украина Украине

The verb form прошу is homonymous in ua and ru. When it stands alone or with a proper name (which can be also homonymous in ua and ru), I would identify it as ua by default (it will be accurate in probably over 90% of cases). In the frequent phrase: прошу передати --- ua прощу передать --- ru

Probably the most difficult case of homonymy in this corpus is the word да "yes", esp. when it stands alone. It can be used in both languages but is more common and has more meanings in Russian. Also, repetitions like да, да or да-да are more likely to be used in Russian.

AnnaParla commented 1 year ago

Are different word forms of the same headword ok on one line or do you want me to do them one per line?

matyaskopp commented 1 year ago

Are different word forms of the same headword ok on one line or do you want me to do them one per line?

it is ok.


changes have been done so far: https://github.com/ufal/ParlaMint-UA/compare/44d52f54af90df2cece0a73239e5fa8b80bc49df...1a627dc9d2f2e7890c4882b1ba330c70ccd326c6

matyaskopp commented 1 year ago

I use the word decision for ru only for short paragraphs: https://github.com/ufal/ParlaMint-UA/blob/baebfe15ed25ca0f3f1ca12f649d54f09e43719b/Scripts/lang-detect.pl#L129

I can use it for longer ones, but I am not sure if these words can rarely be used even in Ukrainian.

AnnaParla commented 1 year ago

Are different word forms of the same headword ok on one line or do you want me to do them one per line?

it is ok.

changes have been done so far: 44d52f5...1a627dc

It looks like some words from my list were not recognized.

The following should be ru:

передать Александру https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-04-m0.xml#L299

всем начать и (which stands alone) https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-12-m2.xml#L127

передать коллеге https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-19-m1.xml#L142

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-10-24-m1.xml#L545

передать https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-19-m1.xml#L405

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-22-m0.xml#L831

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-10-25-m0.xml#L665

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-10-25-m0.xml#L757

спасибо https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-22-m0.xml#L822

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-05-22-m0.xml#L256

спасибо и (standing alone) только https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-09-04-m0.xml#L155

вопрос спасибо https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-06-18-m0.xml#L920

внимание https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-05-14-m1.xml#L272

внимание вопрос еще Николаю и (standing alone) https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-10-09-m0.xml#L1189

чтении спасибо https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-10-10-m0.xml#L1258

прощения https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-10-25-m0.xml#L890

AnnaParla commented 1 year ago

And the following utterances should be labeled as ua:

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-12-m2.xml#L288

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-04-04-m0.xml#L1073

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-05-23-m1.xml#L160

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-05-14-m0.xml#L769

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-06-18-m0.xml#L1015

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-06-20-m0.xml#L996

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-07-04-m0.xml#L941

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-07-04-m0.xml#L1325

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-09-03-m0.xml#L569

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-11-06-m0.xml#L1318

https://github.com/ufal/ParlaMint-UA/blob/1a627dc9d2f2e7890c4882b1ba330c70ccd326c6/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-21-m1.xml#L298 In theory, the phrase above is written the same in ua and ru, but it is pronounced in ua (the speaker used the utterance before in ua; that speaker was a famous Ukr writer and never used ru in public settings; that chairman was likely to use ua in response to ua, esp. if he linked two ua speakers). But this is something the software does not know :)

All these phrases above do not meet the formal criteria for ru that we agreed on: any of the unique ru characters or a word from the list for short utterances. Why were they labeled as ru anyway?

AnnaParla commented 1 year ago

Multiple mistakes in one of the most frequent formulaic phrases, which looks similar in ua and ru:

Прошу передати слово --- always ua

Прошу передать слово --- always ru

Прошу (standing alone) --- let it always be ua in these contexts (the word is polysemous and can be used in both ua and ru, but when it stands alone, it is commonly used by a chairperson for encouragement or permission, which is typical of ua)

matyaskopp commented 1 year ago

All these phrases above do not meet the formal criteria for ru that we agreed on: any of the unique ru characters or a word from the list for short utterances. Why were they labeled as ru anyway?

short (<=50) utterances are fixed https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-21-m1.xml#L298

current state: https://github.com/ufal/ParlaMint-UA/compare/44d52f54af90df2cece0a73239e5fa8b80bc49df...76fa47f7344b8f64a28947e01df0ce291141974a

AnnaParla commented 1 year ago

Should be in ru (contain words from the list):

передать Александру https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-04-m0.xml#L299

и (standing alone) начать спасибо https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-12-m2.xml#L127

передать коллеге https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-19-m1.xml#L142

передать https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-18-m0.xml#L402

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-19-m1.xml#L405

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2013/ParlaMint-UA_2013-04-16-m1.xml#L811

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2013/ParlaMint-UA_2013-04-18-m0.xml#L619

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2013/ParlaMint-UA_2013-01-11-m0.xml#L857

Contains ы https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-19-m0.xml#L461

AnnaParla commented 1 year ago

Should be in ua (do not contain ru characters or words from the list):

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-12-m2.xml#L288

AnnaParla commented 1 year ago

Should be in ru (these words are not on the list yet)

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-22-m0.xml#L524

AnnaParla commented 1 year ago

спасибі --- always ua

спасибо --- always ru https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2013/ParlaMint-UA_2013-03-22-m0.xml#L822

AnnaParla commented 1 year ago

This needs to be fixed somehow. The utterance below is in ua. Chairman Stefanchuk is a ua native speaker and he sticks to ua all the time, but he may use some individual words that are the norm in ru and not in ua: дальше is case in point.

What is technically more feasible: to remove дальше from the list of ru words or fix this utterance manually?

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2022/ParlaMint-UA_2022-12-13-m0.xml#L1545

AnnaParla commented 1 year ago

Term 9

The following are ua, no grounds to label them otherwise:

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2020/ParlaMint-UA_2020-03-06-m0.xml#L1348

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2020/ParlaMint-UA_2020-05-20-m0.xml#L1051-L1052

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2020/ParlaMint-UA_2020-07-02-m0.xml#L1060

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2020/ParlaMint-UA_2020-07-02-m0.xml#L204

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2021/ParlaMint-UA_2021-04-27-m2.xml#L938

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2021/ParlaMint-UA_2021-06-16-m0.xml#L1433

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2021/ParlaMint-UA_2021-11-04-m0.xml#L2571

Spelling mistakes in the original file that cause identification problems:

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2020/ParlaMint-UA_2020-03-04-m1.xml#L1767

Should be ru:

https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2020/ParlaMint-UA_2020-03-03-m0.xml#L1129

matyaskopp commented 1 year ago

The algorithm is this:

  1. count all country-specific characters - most frequent determines the language uk/ru
  2. if the word from Ukrainian list then uk
  3. if the word from Russian list then ru
  4. if text is shorter than 50 characters then uk
  5. if nothing above then external perl library determines language (if not uk or ru then uk is used)

so:

uk word: Прошу https://github.com/ufal/ParlaMint-UA/blob/d3628c162176185dd52c2c4975293e765d87adc6/Scripts/lang-detect.pl#L29 https://github.com/ufal/ParlaMint-UA/blob/76fa47f7344b8f64a28947e01df0ce291141974a/03t-tei-text-lang/2012/ParlaMint-UA_2012-12-04-m0.xml#L299


probably the list of Ukrainian words needs to be reviewed too: https://github.com/ufal/ParlaMint-UA/blob/d3628c162176185dd52c2c4975293e765d87adc6/Scripts/lang-detect.pl#L22-L37