own-pt / sensetion.el

Emacs word-sense annotation interface
GNU General Public License v3.0
4 stars 2 forks source link

MWE annotation #149

Open alexandretessarollo opened 5 years ago

alexandretessarollo commented 5 years ago

There should be a way of targeting MWE for globbing in a easy way.

Currently, to glob, say, drilling mud, one must target for either drilling or mud, and manually search for co-occurrences and glob them. However drilling and mud are both very frequent word in the corpus, and not always together. Even so, there is a huge amount of drilling mud to be globbed and annotated.

hmuniz commented 5 years ago

Could you be more clear about this issue? What exactly are you suggesting to improve the annotation of globs? Maybe KeyboardMacros helps you.

alexandretessarollo commented 5 years ago

It probably does the trick, but I'm not quite sure. Let me try to write a pseudo-code to help make my point.

1) call sensation on targeted mode, looking for "drilling", PoS any 2) for each "drilling"check if word-on-the-right is "mud". 3) if 2 = true, then glob it with lemma "drilling mud", else do nothing. 4) repeat 2 and 3 until last occurrence of "drilling".

It is something that could be done manually, but "drilling"has 230 occurrences, "mud" has 95 occurrences and "drilling mud" shows up ate least 16 times. Those are very high counts to rely on visual inspection. And that's just one MWE. With "drilling" alone we have "drilling bit", "drilling rig", "drilling column", etc.

Ideally, I should be able to provide a list of MWEs and Sensetion would automatically glob them within the corpus.

arademaker commented 5 years ago

This is a very particular description. We need a more general approach for the tool. Moreover, I am not sure if that should be a functionality in the sensetion or implemented as scripts.

alexandretessarollo commented 5 years ago

Ok, maybe not work form a list. Still, when I create a glob that sensetion wasn't able to recognize in the first place, it should have a way to let me know and check the remaining occurrences of that glob in in the text.

arademaker commented 5 years ago

The @hmuniz suggestion of using emacs macros is really cool. We can temporarily define a single key to execute a sequent of command. So, in a buffer with K occurrences of drilling mud, we could record a sequence of commands such as:

  1. search for drilling mud
  2. mark both (m, m)
  3. press g
  4. write 'drilling mud'
  5. choose n

as a single key command. So instead of K*(approx. 13+2+1+12+1) keys, we would press only K keys. But the important point, keeping the visual inspection and manual inspection for quality control.

odanoburu commented 5 years ago

originally our idea was to have globbing be automatic, using the enrich.py script; manual input would be restricted to difficult cases not detected by the globbing mechanism, or to unglobbing wrong globs (which is much faster than globbing)

arademaker commented 5 years ago

Yep, that is the reason for my https://github.com/own-pt/sensetion.el/issues/149#issuecomment-520559820 above.