Source code completion are broken

fischerling commented 9 months ago

Since applying https://github.com/martanne/vis/pull/946/commits/711447afd4a7857d1ecb173893009924b07d4993 source code word completion is broken.

In source code the words I want to complete are not always seperated by whitespace.

For example in:

def foo(bar):
  ba

ba is not completeable to bar because bar is not surrounded by whitespace.

Furthermore, the completion now suggests useless artifacts, consider:

void foo(int bar);

fo

Now the completions suggests foo(int because it is surrounded by whitespace.

fischerling commented 9 months ago

I suggest using something like sed -E 's/([^[:alnum:]_])/\n/g' instead which seams to work in my quick experiments.

I will use this for a while and report back.

The sed version I use is: sed (GNU sed) 4.9.

rnpnr commented 9 months ago

Sorry about that, I should have noticed that -c [:alnum:] and [:blank:] are not even close to the same thing. I can probably revert that commit for now since vis is primarily a source code editor and it makes more sense to work in that context first.

I suggest using something like sed -E 's/([^[:alnum:]_])/\n/g' instead

That doesn't seem to work. Since the only non-latin scripts I'm familiar with are Chinese & Japanese characters which don't work in either version I found this polish text online:

Jest dostępnych wiele różnych wersji Lorem Ipsum, ale większość
zmieniła się pod wpływem dodanego humoru czy przypadkowych słów,
które nawet w najmniejszym stopniu nie przypominają istniejących.

If you try completing some words from there you should see that your method doesn't work.

fischerling commented 9 months ago

That doesn't seem to work. Since the only non-latin scripts I'm familiar with are Chinese & Japanese characters which don't work in either version I found this [polish text][] online:
Jest dostępnych wiele różnych wersji Lorem Ipsum, ale większość
zmieniła się pod wpływem dodanego humoru czy przypadkowych słów,
które nawet w najmniejszym stopniu nie przypominają istniejących.
If you try completing some words from there you should see that your method doesn't work.

What sed version are you using? Completing words in the polish text you provided works just fine on my system.

rnpnr commented 9 months ago

What sed version are you using?

sbase sed. I tried with GNU sed and it does work as you suggested. I checked and GNU implements their own regex library whereas sbase just uses the standard POSIX regex library.

Shugyousha commented 9 months ago

Randy Palamar @.***> wrote:

Sorry about that, I should have noticed that -c [:alnum:] and [:blank:] are not even close to the same thing. I can probably revert that commit for now since vis is primarily a source code editor and it makes more sense to work in that context first.

My bad! I wasn't thinking about the source code completion case.

I suggest using something like sed -E 's/([^[:alnum:]_])/\n/g' instead

How about something like this tr -s '[:blank:]{}()[]_' '\n' ? I think that should cover the programming use case as well (though I did test this only lightly).

Cheers, Silvan

fischerling commented 9 months ago

My bad! I wasn't thinking about the source code completion case.

But this does not even work for normal punctuation in English text. You can not complete the last word in a sentence ended by punctuation (e.g . or ?).

I suggest using something like sed -E 's/([^[:alnum:]])/\n/g' instead How about something like this `tr -s '[:blank:]{}()[]' '\n'` ? I think that should cover the programming use case as well (though I did test this only lightly). Cheers, Silvan

We would need a lot of special characters in our regex, for example ',' is definitely missing as well.

Or imagine writing XML or HTML code where we would need <and > symbols.

Using a multi-byte encoding negative match of characters allowed in identifiers, [:alnum:]_, is probably the only really universally working solution.

Maybe we can detect GNU sed or make the command somehow configurable.

Or in the worst case we could implement a simple portable binary, doing exactly what we need, splitting multi-byte encoded text by [:alnum:]_. Supposed we find some one willing to actually do it.

fischerling commented 9 months ago

Something as simple as the following C program is probably enough.

#include <err.h>
#include <errno.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <wctype.h>

int main() {
    wint_t c;
    setlocale(LC_ALL, "");

    errno = 0;
    while((c=fgetwc(stdin)) != WEOF) {
        if (!(iswalnum(c) || c == btowc('_')))
            c = btowc('\n');

        wint_t res = fputwc(c, stdout);
        if (res == WEOF)
            err(EXIT_FAILURE, "%s", "Failed to write to stdout");
    }

        if (errno)
            err(EXIT_FAILURE, "%s", "Failed to read from stdin");
}

Shugyousha commented 9 months ago

fischerling @.***> wrote:

My bad! I wasn't thinking about the source code completion case.

But this does not even work for normal punctuation in English text. You can not complete the last word in a sentence ended by punctuation (e.g . or ?).

Yes, I cannot say that this was well tested. Not enough dog fooding :/

I suggest using something like sed -E 's/([^[:alnum:]])/\n/g' instead How about something like this `tr -s '[:blank:]{}()[]' '\n'` ? I think that should cover the programming use case as well (though I did test this only lightly). Cheers, Silvan

We would need a lot of special characters in our regex, for example ',' is definitely missing as well.

Or imagine writing XML or HTML code where we would need <and > symbols.

I don't think adding a lot of special characters to the pattern is an issue. If we think we can restore the functionality by adding a few more characters (like ',<>%^&.' and maybe some other ones) to that set, I think it would be an easy solution.

This wouldn't solve the issue of not being able to pass multibyte sequences to GNU tr which means we couldn't use Unicode code points (like Chinese/Japanese punctuation, i.e. 。，) there. Apparently GNU sed would support those though so we could use that if we decide to go that route.

Cheers, Silvan

rnpnr commented 9 months ago

I would like to avoid making vis-complete dependent on the GNU versions of tools. I think I prefer tr with some hand picked character set that is good enough. Its probably better to avoid the terrible {w,}ctype(3) classes altogether as well. Something like tr -s '\t {}()[],<>%^&.' '\n' is probably good enough for most uses.

Edit: Sorry, _ shouldn't be in the set.

Shugyousha commented 9 months ago

Randy Palamar @.***> wrote:

I would like to avoid making vis-complete dependent on the GNU versions of tools. I think I prefer tr with some hand picked character set that is good enough. Its probably better to avoid the terrible {w,}ctype(3) classes altogether as well. Something like tr -s '\t {}()[]_,<>%^&.' '\n' is probably good enough for most uses.

Yes, I agree. It doesn't cover the unicode use case but it should take us quite far for the programming one.

Should I send a patch?

rnpnr commented 9 months ago

Should I send a patch?

Sure, if you want. I was mostly just waiting to see if this fixed the issue for both of you. Just a small thing since you replied to the email; it should be:

tr -s '\t {}()[],<>%^&.' '\n'

without the _. This way it completes whole snake case function names as they are used in vis. Its also inline with the old behaviour.

fischerling commented 9 months ago

Should I send a patch?

Sure, if you want. I was mostly just waiting to see if this fixed the issue for both of you. Just a small thing since you replied to the email; it should be:
tr -s '\t {}()[],<>%^&.' '\n'
without the _. This way it completes whole snake case function names as they are used in vis. Its also inline with the old behaviour.

This is definitely way better than using [:blank:]. I used it for the last day[s] and it seams to work so far.

However I could still locally patch it to use GNU sed so feel free to update the regex.

martanne / vis

Source code completion are broken #1132