own-pt / sensetion.el

Emacs word-sense annotation interface
GNU General Public License v3.0
4 stars 2 forks source link

"two-career" is being shown as "two career" #183

Open arademaker opened 4 years ago

arademaker commented 4 years ago

Losing the original text? Is it the right thing to do?

((kind "wf")
   (form . "two")
   (lemmas "two")
   (tag . "ignore")
   (meta
    (sep . "-")
    (type . "num")))

we do have the sep for produze the original text. Question is:

  1. is it easier to have the text tokenized in the buffer?
  2. should we not distinguish between spaces and other separators?

Remember that default sep is space, so when a token doesn't have sep it is assumed sep=" ". See confusing explanation in https://github.com/own-pt/glosstag/blob/princeton/dtd/glosstag.dtd#L158-L161 for the glosstag corpus !!

odanoburu commented 4 years ago

Losing the original text?

not really losing, but not showing it properly indeed.

the proper tokenization (using sep as separator when available and a space as default) could be implemented, but then I'm not sure if any other corpus will have sep attributes to make it worthwhile… how is tokenization described by other tokenizers? could touch.py produce something akin to sep? would it be useful to do so?