Open vcvpaiva opened 7 years ago
Nós temos 'waterski' como 'water_ski' em http://wnpt.brlcloud.com/wn/synset?id=01940248-v. É uma MWE. O google translate traduz 'The woman is waterskiing' como 'A mulher é esqui aquático' mas nosso parser acertou aqui. Minha sugestão seria adicionar 'waterski' como lemma no synset e no dicionário de Freeling também.
Wikipedia diz https://en.wikipedia.org/wiki/Water_skiing (separado) , google entende junto. Qual o critério para considerar uma genuína palavra nova ou um typo do corpus?
'dirtbike' parece ser um tipo específico de bike. Logo, uma novo hypo para http://wnpt.brlcloud.com/wn/synset?id=01935476-v? Também precisa entrar no dicionário de freeling. Também podemos sugerir um synset para o nome dirtbike com gloss "a motorcycle designed for use on rough terrain, such as unsurfaced roads or tracks, and used especially in scrambling.".
Mesmo para wakeboard, temos o substantivo http://wnpt.brlcloud.com/wn/synset?id=04544626-n, falta o verbo, outra sugestão de synset boa. Seria um novo hypo para http://wnpt.brlcloud.com/wn/synset?id=01948077-v, um tipo de surf ou mais acima, hyponym de http://wnpt.brlcloud.com/wn/synset?id=01887576-v (glide).
Em https://en.wikipedia.org/wiki/Kickboxing não parece ser um tipo particular de https://en.wikipedia.org/wiki/Boxing, em comum apenas serem artes marciais. O nome merece um novo synset como hyponym de http://wnpt.brlcloud.com/wn/synset?id=00825443-n ou http://wnpt.brlcloud.com/wn/synset?id=00433458-n.
E o verbo um novo synset relacionado (hyponym ou classifiedBy) à http://wnpt.brlcloud.com/wn/synset?id=00523513-n
pode ser um tipo específico de http://wnpt.brlcloud.com/wn/synset?id=01419982-v ou
footbag também existe https://en.wikipedia.org/wiki/Footbag, parece tanto querer dizer o objeto usado quanto o nome do jogo. Então teriamos 2 possiveis synsets:
um hyponym de http://wnpt.brlcloud.com/wn/synset?id=00463246-n (game) um hyponym de http://wnpt.brlcloud.com/wn/synset?id=02778669-n (bola)
@arademaker as I said in #7
there is some creative verb forming for sports:
a person dirtbikes along a muddy trail two men are fistfighting in a ring two men fistfight in a ring A woman is wakeboarding on a lake
but footbag (bola de meia como esporte internacional, com federacao e Guiness records?) was news to me.
a different phenomenon is the verbs that we get only in the gerund: 30 standing 13 snowboarding 8 landing 7 biking 5 seasoning 5 kickboxing 4 waterskiing 4 leaning 4 burning 3 skateboarding 3 kayaking 3 deboning 2 wakeboarding 2 smoking 2 parking 2 painting 2 hunting 1 fistfighting
when it comes to only one occurrence we have some strange stuff: several typoes (wlking, weas, somthing, sitring, shappens, retrives, persue, manuever, darked) but more interesting compounds: (bare-chested, brindle-colored, cross-legged) and bride, preteen, malnourished (not verbs in English), unstitch, graffiti, graphite (not verbs in PWN).
Examining the non-verbs:
"preteen" in Two twin preteen boys are playing with cards has right root, playing and Freeling makes preteen adjective, don't know why Parsey says verb. (WSD gets twin as a bed though). in the other two sentences with preteen it gets adj and noun (Two twin preteen kids are dueling with sticks)
bride and malnourished as verbs were rewritten (bride with a white veil looking down, a dog appearing to be malnourished is standing on his hind legs about to jump )
unstitch(ing) is not a verb in PWN (A woman is unstitching with a machine) neither are graphitized or graffitied (A bicyclist is performing a trick over a heavily graphitized wall -- after normalization , a bicyclist performing a trick over a heavily graffitied wall, before)
In the commit c25931cccafd7c07d8baf076f137b19446bfc871 I added the files
https://github.com/own-pt/rte-sick/blob/master/expanded.words.txt https://github.com/own-pt/rte-sick/blob/master/expanded.txt
We randomly selected a subset of sentence pairs from each of these sources and we applied a 3-step generation process: first, the original sentences were normalized to remove unwanted linguistic phenomena; the normalized sentences were then expanded to obtain up to three new sentences with specific characteristics suitable to CDSM evaluation; as a last step, all the sentences generated in the expansion phase were paired with the normalized sentences in order to obtain the final data set.
The expanded sentences are the normalized/generated ones.
thanks @arademaker!!! this is very useful indeed! it confirms my heuristic that sentences that start with a capital letter were the rewritten ones, with one exception, a sentence that starts with a space " water from the faucet is being drunk by a yellow dog". also confirms two new words (meaning not in PWN) after the "normalization": Seadoo (personal seacraft) and ATV (all terrain vehicle)
Some of the data above is superseded by the normalized sentences only (where some debatable lexical items were removed).
But we still have 10 cases of "snowboarding" as a verb infinitive, as well as the others below, which I guess are problems with Freeling lemmatization:
7 striped 6 biking 4 seasoning 4 breaded 3 waterskiing 3 skateboarding 3 pyramid-shaped 3 kickboxing 3 kayaking 3 burning 2 wakeboarding 2 upside-down 2 smoking 2 sleeved 2 mittened 2 leaning 2 landing 2 hunting 2 deboning
waterski: verb dirtbike: verb wakeboard: verb kickbox: verb
footbag: noun biker: noun youngling: noun
shirtless: adjective