Open alexandretessarollo opened 5 years ago
Could you be more clear about this issue? What exactly are you suggesting to improve the annotation of globs? Maybe KeyboardMacros helps you.
It probably does the trick, but I'm not quite sure. Let me try to write a pseudo-code to help make my point.
1) call sensation on targeted mode, looking for "drilling", PoS any 2) for each "drilling"check if word-on-the-right is "mud". 3) if 2 = true, then glob it with lemma "drilling mud", else do nothing. 4) repeat 2 and 3 until last occurrence of "drilling".
It is something that could be done manually, but "drilling"has 230 occurrences, "mud" has 95 occurrences and "drilling mud" shows up ate least 16 times. Those are very high counts to rely on visual inspection. And that's just one MWE. With "drilling" alone we have "drilling bit", "drilling rig", "drilling column", etc.
Ideally, I should be able to provide a list of MWEs and Sensetion would automatically glob them within the corpus.
This is a very particular description. We need a more general approach for the tool. Moreover, I am not sure if that should be a functionality in the sensetion or implemented as scripts.
Ok, maybe not work form a list. Still, when I create a glob that sensetion wasn't able to recognize in the first place, it should have a way to let me know and check the remaining occurrences of that glob in in the text.
The @hmuniz suggestion of using emacs macros is really cool. We can temporarily define a single key to execute a sequent of command. So, in a buffer with K occurrences of drilling mud
, we could record a sequence of commands such as:
n
as a single key command. So instead of K*(approx. 13+2+1+12+1) keys, we would press only K keys. But the important point, keeping the visual inspection and manual inspection for quality control.
originally our idea was to have globbing be automatic, using the enrich.py script; manual input would be restricted to difficult cases not detected by the globbing mechanism, or to unglobbing wrong globs (which is much faster than globbing)
Yep, that is the reason for my https://github.com/own-pt/sensetion.el/issues/149#issuecomment-520559820 above.
There should be a way of targeting MWE for globbing in a easy way.
Currently, to glob, say,
drilling mud
, one must target for eitherdrilling
ormud
, and manually search for co-occurrences and glob them. Howeverdrilling
andmud
are both very frequent word in the corpus, and not always together. Even so, there is a huge amount ofdrilling mud
to be globbed and annotated.