jonorthwash / ud-annotatrix

GNU General Public License v3.0
61 stars 49 forks source link

Bring back space-to-tab conversion (for pasted input) #405

Open ftyers opened 4 years ago

ftyers commented 4 years ago

When copy/pasting from vim in a terminal, spaces are copied instead of tabs. Previously we had a heuristic method of dealing with this, it replaced 2+ spaces with a tab. This worked 95% of the time and it would be great to get back.

jonorthwash commented 4 years ago

When copy/pasting from vim in a terminal

This is about 40% of the reason I use gvim.

ftyers commented 4 years ago

It also applies to copying from a bad pastebin that replaces tabs with spaces.

jonorthwash commented 4 years ago

It also applies to copying from a bad pastebin that replaces tabs with spaces.

Yeah, we can do some sort of space-to-tab conversion through heuristics—and I do remember this working—but really everyone should be using tabs. Any software that converts tabs to spaces without asking you should be avoided.

keggsmurph21 commented 4 years ago

Hm, there is a multiple-space-to-tab conversion function built into notatrix, so I would expect this to work. I'm not immediately sure why it wouldn't.

Also, when I copy/paste from Vim, it preserves my <tab> characters, so I'm not sure I'll be able to reproduce this. I definitely agree that it should be a supported feature.

At the very least, we could do our own multiple-space-to-tab conversions before passing the input along to notatrix, just to make sure that it's behaving how we expect.

jonorthwash commented 4 years ago

Hm, there is a multiple-space-to-tab conversion function built into notatrix, so I would expect this to work. I'm not immediately sure why it wouldn't.

Could you clarify how notatrix is used? E.g., if one clones ud-annotatrix and just serves the code (or if one hosts on github), how is notatrix leveraged? Is it a dependency that lives somewhere in the ud-annotatrix repo too? If so, might it need to be updated?

jonorthwash commented 4 years ago

Hm, there is a multiple-space-to-tab conversion function built into notatrix, so I would expect this to work. I'm not immediately sure why it wouldn't.

Also, when I copy/paste from Vim, it preserves my <tab> characters, so I'm not sure I'll be able to reproduce this. I definitely agree that it should be a supported feature.

At the very least, we could do our own multiple-space-to-tab conversions before passing the input along to notatrix, just to make sure that it's behaving how we expect.

Try a different terminal. Most terminals suck at this, probably by design. I just tested xfce4-terminal, terminator, mlterm, and konsole (all of which were already on my laptop), and they all copied spaces from a tab in vim, both in select/middle-click copies and regular copy/paste copies (i.e., both standard copy-paste buffers had this issue). Pastes were tested into Firefox, but ime anywhere else is also a problem, especially back into vim :-P

keggsmurph21 commented 4 years ago

Hm, there is a multiple-space-to-tab conversion function built into notatrix, so I would expect this to work. I'm not immediately sure why it wouldn't.

Could you clarify how notatrix is used? E.g., if one clones ud-annotatrix and just serves the code (or if one hosts on github), how is notatrix leveraged? Is it a dependency that lives somewhere in the ud-annotatrix repo too? If so, might it need to be updated?

Replied to this question in https://github.com/jonorthwash/ud-annotatrix/issues/397#issuecomment-636520902. If you cloned the repo and are hosting locally, you may need to refresh dependencies (via npm install --save-dev).

ftyers commented 3 years ago

You can try: https://dpaste.com/H6X6ABMUC In: https://ftyers.github.io/ud-annotatrix/standalone/annotator.html and in: https://jonorthwash.github.io/ud-annotatrix/

jonorthwash commented 2 years ago

I'm having trouble reproducing this issue.

You can try: https://dpaste.com/H6X6ABMUC

This dpaste is no longer available. Could you paste something where you're encountering this into this issue?

ftyers commented 2 years ago

https://dpaste.com/6HYV8MFMP

jonorthwash commented 2 years ago

https://dpaste.com/6HYV8MFMP

Is this even valid CoNLL-U? It has only one space in yehuatl PRON. If you add another space there it works fine.

Regardless of validity, the algorithm (and what you stated the issue was) is there there have to be at least two spaces between each token. Otherwise, how would it know it's not meant to be a single column?

ftyers commented 2 years ago
It isn't valid conllu because it doesn't have tabs
I think a heuristic can be made in the case of single spaces 
Only some of the columns can have spaces in, E.g. form/lemma/misc 
So the fact that its the third column and the following column contains a UPOS tag is pretty good evidence 
jonorthwash commented 2 years ago

So a stupidish algorithm that would get this case and some others could be:

   if spaces in line:
      line.split(\s{2,})              # current behaviour
      if count(columns) not correct:  # new
         for column in certainColumns:
            if \s+ in column and intersection(set(certainTags), set(column.split())) > 0:
               column.split()
   if count(columns) not correct:
      sentence = invalid              # current bahaviour?