natalink / mwe_noske

0 stars 0 forks source link

should we make two separate attributes: type and head/child? #5

Closed natalink closed 6 years ago

natalink commented 6 years ago

From a reviewer:

Why not have head/child and MWE type in separate attributes,is there any advantage of squishing them into one?

natalink commented 6 years ago

This will make a problem how to encode e.g. this:

['_', '1:LVC', '_', '1;2:LVC', '_', '_', '_', '_', '2', '_']
form lemma ....  LVC head
.........................................
form lemma ....  LVC;LVC child;head .....

This solution makes sense from a user point of view -- no need to write regex each time (vmwe="LVC.*"), but really sucks for cases like above.

Ansa211 commented 6 years ago

I'd go for type and head/child + additional attribute called lvc_id (taken from the parseme data). It would make no sense to search for a particular value of lvc_id; but it could be compared to lvc_id of other words in the same sentence: (meet -5 5 1:[lvc_dependency="head"] 2:[lvc_id=1.lvc_id]) within <s/>

The left side is the original, on the right side is my suggestion of encoding:

1 They     they      _                 _         _           _    _
2 were     be        _                 _         _           _    _
3 letting  let       1:VPC;2:VPC       VPC;VPC  head;head    1;2  let in;let out
4 us       we        _                 _        _            _    _
5 in       in        1                 VPC      child        1    let in;let out
6 and      and       _                 _        _            _    _
7 out      out       2                 VPC      child        2    let in;let out

nešlo      jít       2:LVC             LVC      head         2    jít_o_vpadnutí_do_zad       
tedy       tedy      _                 _        _            _    _
o          o         2                 LVC      child        2    jít_o_vpadnutí_do_zad
žádné      žádný     _                 _        _            _    _
vpadnutí   vpadnutí  1:ID;2            ID;LVC   head;child   1;2  vpadnutí_do_zad;jít_o_vpadnutí_do_zad
do         do        1;2               ID;LVC   child;child  1;2  vpadnutí_do_zad;jít_o_vpadnutí_do_zad 
zad        záda      1;2               ID;LVC   child;child  1;2  vpadnutí_do_zad;jít_o_vpadnutí_do_zad 

I think this would solve the problem of nested or overlapping queries, woudn't it?

languagerecipes commented 6 years ago

i think it is nice to think of api baded usages, where the manatee is called to bring result for a software that uses the annotations. there is no best solution; what metters most is to have a use case and an example for the annotation usages. lastly i prefer simpler structure, if the schema is too complicated, then it will loose its aplicabiloty and user friendliness: we don’t like to write two full lines of cql for making concordance for a vmwe?! :)

On Friday, December 8, 2017, Anša Vernerová notifications@github.com wrote:

I'd go for type and head/child + additional attribute called lvc_id (taken from the parseme data). It would make no sense to search for a particular value of lvc_id; but it could be compared to lvc_id of other words in the same sentence: (meet -5 5 1:[lvc_dependency="head"] 2:[lvc_id=1.lvc_id]) within

The left side is the original, on the right side is my suggestion of encoding:

1 They they 2 were be 3 letting let 1:VPC;2:VPC VPC;VPC head;head 1;2 let in;let out 4 us we 5 in in 1 VPC child 1 let in;let out 6 and and 7 out out 2 VPC child 2 let in;let out

nešlo jít 2:LVC LVC head 2 jít_o_vpadnutí_dozad tedy tedy o o 2 LVC child 2 jít_o_vpadnutí_dozad žádné žádný vpadnutí vpadnutí 1:ID;2 ID;LVC head;child 1;2 vpadnutí_do_zad;jít_o_vpadnutí_do_zad do do 1;2 ID;LVC child;child 1;2 vpadnutí_do_zad;jít_o_vpadnutí_do_zad zad záda 1;2 ID;LVC child;child 1;2 vpadnutí_do_zad;jít_o_vpadnutí_do_zad

I think this would solve the problem of nested or overlapping queries, woudn't it?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/natalink/mwe_noske/issues/5#issuecomment-350245704, or mute the thread https://github.com/notifications/unsubscribe-auth/AHuwE_JdRYe5ngQeCH2K2TVCAcw12ZX3ks5s-SLvgaJpZM4Q6m2s .

natalink commented 6 years ago

I started to change the code and added a Czech use case. Is it a mistake here in CS/train.parsemetsv that a verb nešlo is not marked or I miss something?

16      nešlo   _       _
17      tedy    _       _
18      o       _       1:ID
19      žádné   _       _
20      vpadnutí        _       1;2:LVC
21      do      _       1;2
22      zad     _       1;2

@e-bej @Ansa211

natalink commented 6 years ago

Ok, I did it like this:

nešlo   jít     1        ...   LVC     head    1       jít o vpadnutí do záda
tedy    tedy    2     ...    _       _       _       _
o       o       3      ...     child   1       jít o vpadnutí do záda
žádné   žádný  ...    _       _       _       _
vpadnutí        vpadnutí     ...      LVC;LVC child;head      1;2     jít o vpadnutí do záda;vpadnutí do záda
do      do     ...     LVC;LVC child;child     1;2     jít o vpadnutí do záda;vpadnutí do záda
zad     záda    ...      LVC;LVC child;child     1;2     jít o vpadnutí do záda;vpadnutí do záda
Ansa211 commented 6 years ago

I think it is a mistake, that's why I have it marked differently in my example above.

e-bej commented 6 years ago

Well, I don't think there is anything like "jít o vpadnutí do zad". I would say that there is only one MWE: "vpadnutí do zad". The other part, "jít o X" is a pure valency issue, not a MWE, and it can be combined with really anything (i.e. that "X" can be either a word, or a MWE, or a sentence, or whatever).

That's my linguistic intuition. However, the data you've cited has even the word "o" marked as a part of somtething (something strange). Do you want me to find out how has that happened?

Ansa211 commented 6 years ago

No mně se ta česká data zdají celá trochu divná. kromě "nešlo tedy o žádné vpadnutí do zad", kde jsou LVC "vpadnutí do zad" (ok?) a ID "o vpadnutí do zad (?!) mě zaráží např.:

  • ve větě "jaká je míra zamoření, kterou chceme nést a čeho se kdo musí vzdát" je "se musí vzdát" označené jako IReflV (opravdu včetně "musí", ačkoli např. ve větě "musí v případě nouze jet vlastním autem" žádné MWE obsahující "musí" není)
  • ve větě "každý stupeň nad 20 ° C znamená zvýšení spotřeby energie o 6 %" je "zvýšení spotřeby" označené jako LVC, podobně jinde "vyhlášení konkurzu", "zaplacení částky" (vážně se jedná o slovesné MWE?)
  • v "by se daly spočítat na prstech jedné ruky" slůvko "se" není součástí označené MWE, všechna ostatní ano
  • "která by se stejně jako zvýšení či snížení platu řídila zhodnocením" - proč jsou označená dvě MWE, totiž "by se řídila" a "se řídila", obě IReflV? vždyť se jedná jen o časování téže jednotky; podobně např. "by se měl v letošním roce zvyšovat", ale v "měli jsme možnost" není varianta bez "jsme" označená

Anša

----- On 13 Dec, 2017, at 15:48, e-bej notifications@github.com wrote:

| Well, I don't think there is anything like "jít o vpadnutí do zad". I would say | that there is only one MWE: "vpadnutí do zad". The other part, "jít o X" is a | pure valency issue, not a MWE, and it can be combined with really anything | (i.e. that "X" can be either a word, or a MWE, or a sentence, or whatever).

| That's my linguistic intuition. However, the data you've cited has even the word | "o" marked as a part of somtething (something strange). Do you want me to find | out how has that happened?

| — | You are receiving this because you modified the open/close state. | Reply to this email directly, [ | https://github.com/natalink/mwe_noske/issues/5#issuecomment-351412432 | view it | on GitHub ] , or [ | https://github.com/notifications/unsubscribe-auth/ABv7ihx0k7hrRq2tFcIg_MZdoXTE_SKrks5s_-PBgaJpZM4Q6m2s | | mute the thread ] .