github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.35k stars 4.27k forks source link

tokenizer.rb:17: [BUG] Segmentation fault #5938

Closed lildude closed 2 years ago

lildude commented 2 years ago

Moving over from discussions as this is definitely a bug.

Originally discussed in https://github.com/github/linguist/discussions/5876

Originally posted by **ceshiyixiaba** April 25, 2022 git repository: https://github.com/ceshiyixiaba/languages [docker enviroment](https://github.com/github/linguist#docker),error message: ```text docker run --user=1000 --rm -v $(pwd):$(pwd) -w $(pwd) -t linguist | more /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/tokenizer.rb:17: [BUG] Segmentation fault at 0x0000004700000057 ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux-musl] -- Control frame information ----------------------------------------------- c:0026 p:---- s:0154 e:000153 CFUNC :extract_tokens c:0025 p:0007 s:0149 e:000148 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/tokenizer.rb:17 c:0024 p:0043 s:0144 e:000143 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/classifier.rb:106 c:0023 p:0031 s:0136 e:000135 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/classifier.rb:84 c:0022 p:0043 s:0129 e:000128 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/classifier.rb:22 c:0021 p:0011 s:0122 e:000121 BLOCK /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:32 c:0020 p:0026 s:0119 e:000118 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:100 c:0019 p:0026 s:0113 e:000112 BLOCK /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:31 [FINISH] c:0018 p:---- s:0108 e:000107 CFUNC :each c:0017 p:0020 s:0104 e:000103 BLOCK /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:29 c:0016 p:0026 s:0099 e:000098 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:100 c:0015 p:0042 s:0093 e:000092 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:24 c:0014 p:0026 s:0086 e:000085 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/blob_helper.rb:368 c:0013 p:0041 s:0082 e:000081 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/lazy_blob.rb:70 c:0012 p:0030 s:0077 e:000076 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/blob_helper.rb:383 c:0011 p:0004 s:0073 e:000072 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:174 c:0010 p:0135 s:0066 e:000065 BLOCK /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:164 [FINISH] c:0009 p:---- s:0057 e:000056 CFUNC :each_delta c:0008 p:0158 s:0053 e:000052 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:149 c:0007 p:0038 s:0044 e:000043 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:116 c:0006 p:0032 s:0040 E:000ea8 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:68 c:0005 p:0176 s:0035 E:001bd8 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/bin/github-linguist:52 c:0004 p:0112 s:0016 e:000015 TOP /usr/local/bundle/gems/github-linguist-7.20.0/bin/github-linguist:142 [FINISH] c:0003 p:---- s:0013 e:000012 CFUNC :load c:0002 p:0112 s:0008 E:002000 EVAL /usr/local/bundle/bin/github-linguist:23 [FINISH] c:0001 p:0000 s:0003 E:000810 (none) [FINISH] ``` I tested different versions: - github-linguist 7.12.2: OK - github-linguist 7.13.0: Error ```bash $ /var/lib/gems/2.7.0/gems/github-linguist-7.12.2/bin/git-linguist stats --commit=`git rev-parse HEAD` {"OpenEdge ABL":116388} $ /var/lib/gems/2.7.0/gems/github-linguist-7.13.0/bin/git-linguist stats --commit=`git rev-parse HEAD` free(): invalid next size (normal) [1] 3231762 abort (core dumped) /var/lib/gems/2.7.0/gems/github-linguist-7.13.0/bin/git-linguist stats $ ruby -v ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux-gnu] $ uname -a Linux VM-0-10-ubuntu 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=20.04 DISTRIB_CODENAME=focal DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS" ```

As pointed out in the discussion, 7.13 was the first release after @smola made changes to the tokenizer in https://github.com/github/linguist/pull/5205 and https://github.com/github/linguist/pull/5186 and https://github.com/github/linguist/pull/5061

lildude commented 2 years ago

For those that encounter the issue until it is fixed, the workaround is to use any override in the .gitattributes file that will prevent the analysis from falling through to the classifier.

Any of the following would do the trick:

# Explicitly state the language and keep it counting towards the language stats. Replace <language> with actual language
path/to/your/file linguist-language=<language>
# Mark the file as vendored, exclude it from the language stats and have it appear expanded in diffs
path/to/your/file linguist-vendored
# Mark the file as generated, exclude it from the language stats and have it appear expanded in diffs
path/to/your/file linguist-generated

If you have multiple affected files, use wildcards as necessary.

lildude commented 2 years ago

I've isolated this down to https://github.com/github/linguist/pull/5060 being the PR that introduced this bug.

Running the analysis with valgrind eventually gives me:

➜ valgrind bundle exec bin/github-linguist tmp/languages/ --breakdown
[...]
==27914== Invalid write of size 4
==27914==    at 0xBDA849F: linguist_yylex (lex.linguist_yy.c:1278)
==27914==    by 0xBDA9E66: rb_tokenizer_extract_tokens (linguist.c:28)
==27914==    by 0x4ABE6C5: vm_call_cfunc_with_frame (vm_insnhelper.c:2514)
==27914==    by 0x4ABE6C5: vm_call_cfunc (vm_insnhelper.c:2539)
==27914==    by 0x4AD9549: vm_call_method (vm_insnhelper.c:3053)
==27914==    by 0x4ACA6C6: vm_sendish (vm_insnhelper.c:4023)
==27914==    by 0x4ACA6C6: vm_exec_core (insns.def:801)
==27914==    by 0x4AD081A: rb_vm_exec (vm.c:1920)
==27914==    by 0x4ADBF38: invoke_block (vm.c:1044)
==27914==    by 0x4ADBF38: invoke_iseq_block_from_c (vm.c:1116)
==27914==    by 0x4ADBF38: invoke_block_from_c_bh (vm.c:1134)
==27914==    by 0x4ADBF38: vm_yield (vm.c:1179)
==27914==    by 0x4ADBF38: rb_yield_0 (vm_eval.c:1227)
==27914==    by 0x4ADBF38: rb_yield_1 (vm_eval.c:1233)
==27914==    by 0x4ADBF38: rb_yield (vm_eval.c:1243)
==27914==    by 0x4880B6B: rb_ary_each (array.c:2135)
==27914==    by 0x4ABE6C5: vm_call_cfunc_with_frame (vm_insnhelper.c:2514)
==27914==    by 0x4ABE6C5: vm_call_cfunc (vm_insnhelper.c:2539)
==27914==    by 0x4AD9549: vm_call_method (vm_insnhelper.c:3053)
==27914==    by 0x4ACBBD4: vm_sendish (vm_insnhelper.c:4023)
==27914==    by 0x4ACBBD4: vm_exec_core (insns.def:782)
==27914==    by 0x4AD081A: rb_vm_exec (vm.c:1920)
==27914==  Address 0x12f52df8 is 0 bytes after a block of size 65,544 alloc'd
==27914==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==27914==    by 0xBDA8816: linguist_yylex (lex.linguist_yy.c:1215)
==27914==    by 0xBDA9E66: rb_tokenizer_extract_tokens (linguist.c:28)
==27914==    by 0x4ABE6C5: vm_call_cfunc_with_frame (vm_insnhelper.c:2514)
==27914==    by 0x4ABE6C5: vm_call_cfunc (vm_insnhelper.c:2539)
==27914==    by 0x4AD9549: vm_call_method (vm_insnhelper.c:3053)
==27914==    by 0x4ACA6C6: vm_sendish (vm_insnhelper.c:4023)
==27914==    by 0x4ACA6C6: vm_exec_core (insns.def:801)
==27914==    by 0x4AD081A: rb_vm_exec (vm.c:1920)
==27914==    by 0x4ADBF38: invoke_block (vm.c:1044)
==27914==    by 0x4ADBF38: invoke_iseq_block_from_c (vm.c:1116)
==27914==    by 0x4ADBF38: invoke_block_from_c_bh (vm.c:1134)
==27914==    by 0x4ADBF38: vm_yield (vm.c:1179)
==27914==    by 0x4ADBF38: rb_yield_0 (vm_eval.c:1227)
==27914==    by 0x4ADBF38: rb_yield_1 (vm_eval.c:1233)
==27914==    by 0x4ADBF38: rb_yield (vm_eval.c:1243)
==27914==    by 0x4880B6B: rb_ary_each (array.c:2135)
==27914==    by 0x4ABE6C5: vm_call_cfunc_with_frame (vm_insnhelper.c:2514)
==27914==    by 0x4ABE6C5: vm_call_cfunc (vm_insnhelper.c:2539)
==27914==    by 0x4AD9549: vm_call_method (vm_insnhelper.c:3053)
==27914==    by 0x4ACBBD4: vm_sendish (vm_insnhelper.c:4023)
==27914==    by 0x4ACBBD4: vm_exec_core (insns.def:782)
==27914== 

valgrind: m_mallocfree.c:305 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 65616, hi = 304942678087.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.

[...]

This leads me here, specifically the line with YY_STATE_BUF_SIZE:

        /* Create the reject buffer large enough to save one state per allowed character. */
        if ( ! yyg->yy_state_buf )
            yyg->yy_state_buf = (yy_state_type *)yyalloc(YY_STATE_BUF_SIZE  , yyscanner);
            if ( ! yyg->yy_state_buf )
                YY_FATAL_ERROR( "out of dynamic memory in yylex()" );

This is defined here:

/* Size of default input buffer. */
#ifndef YY_BUF_SIZE
#ifdef __ia64__
/* On IA-64, the buffer size is 16k, not 8k.
 * Moreover, YY_BUF_SIZE is 2*YY_READ_BUF_SIZE in the general case.
 * Ditto for the __ia64__ case accordingly.
 */
#define YY_BUF_SIZE 32768
#else
#define YY_BUF_SIZE 16384
#endif /* __ia64__ */
#endif

/* The state buf must be large enough to hold one state per character in the main buffer.
 */
#define YY_STATE_BUF_SIZE   ((YY_BUF_SIZE + 2) * sizeof(yy_state_type))

My naïve knowledge of C makes me read this as on ia64 we set a buffer size of ~32K and for everything else we set the buffer to ~16k.

As this is not an IA64 (itanium) architecture we're using the smaller buffer.

So what happens if we bump the non-ia64 buffer to match?

Bingo!!! It works:

➜ bundle exec rake compile
cp lib/linguist/samples.json tmp/x86_64-linux/stage/lib/linguist/samples.json
install -c tmp/x86_64-linux/linguist/2.7.2/linguist.so lib/linguist/linguist.so
cp tmp/x86_64-linux/linguist/2.7.2/linguist.so tmp/x86_64-linux/stage/lib/linguist/linguist.so
➜ bundle exec bin/github-linguist tmp/languages/ --breakdown
100.00% 116388     OpenEdge ABL

OpenEdge ABL:
t.w

I have no idea if this is the right fix, but it seems to do the trick.

@smola what do you think?

dcreager commented 2 years ago

This is great detective work, thanks @lildude! Your pointers led me to some additional documentation that is relevant:

It seems like bumping the size of YY_BUF_SIZE will work in this particular case, but we still run the risk of some other file eventually producing a token bigger than even the larger 32K value. The SO thread also suggests that we might be able to modify the lexer itself to get around this:

So your best solution is to avoid matching huge tokens; for example, strings and comments can be tokenized one line at a time instead of being tokenized as one enormous token.

That seems like it's worth trying, though the flex source is a bit arcane to me.

gpflaum commented 2 years ago

@lildude I tried making your change, and still have the crash. Perhaps I'm not building it right? Changes:

$ git diff ext/linguist/lex.linguist_yy.c
diff --git a/ext/linguist/lex.linguist_yy.c b/ext/linguist/lex.linguist_yy.c
index fe0025b3..1a8823ae 100644
--- a/ext/linguist/lex.linguist_yy.c
+++ b/ext/linguist/lex.linguist_yy.c
@@ -363,7 +363,7 @@ typedef void* yyscan_t;
  */
 #define YY_BUF_SIZE 32768
 #else
-#define YY_BUF_SIZE 16384
+#define YY_BUF_SIZE 32768
 #endif /* __ia64__ */
 #endif

$ git diff ext/linguist/lex.linguist_yy.h
diff --git a/ext/linguist/lex.linguist_yy.h b/ext/linguist/lex.linguist_yy.h
index 9c9fc547..f1499ba5 100644
--- a/ext/linguist/lex.linguist_yy.h
+++ b/ext/linguist/lex.linguist_yy.h
@@ -341,7 +341,7 @@ typedef void* yyscan_t;
  */
 #define YY_BUF_SIZE 32768
 #else
-#define YY_BUF_SIZE 16384
+#define YY_BUF_SIZE 32768
 #endif /* __ia64__ */
 #endif

Building:

$ bundle exec rake compile
cd tmp/x86_64-darwin20/linguist/3.0.0
/usr/bin/make
compiling ../../../../ext/linguist/lex.linguist_yy.c
compiling ../../../../ext/linguist/linguist.c
linking shared-object linguist/linguist.bundle
cd -
cp ext/linguist/lex.linguist_yy.c tmp/x86_64-darwin20/stage/ext/linguist/lex.linguist_yy.c
cp ext/linguist/lex.linguist_yy.h tmp/x86_64-darwin20/stage/ext/linguist/lex.linguist_yy.h
cp lib/linguist/samples.json tmp/x86_64-darwin20/stage/lib/linguist/samples.json
install -c tmp/x86_64-darwin20/linguist/3.0.0/linguist.bundle lib/linguist/linguist.bundle
cp tmp/x86_64-darwin20/linguist/3.0.0/linguist.bundle tmp/x86_64-darwin20/stage/lib/linguist/linguist.bundle

Testing:

$ bundle exec bin/github-linguist ../linguist-crash
/Users/gpflaum/ws/linguist/lib/linguist/tokenizer.rb:17: [BUG] Segmentation fault at 0x00007fcb61caf000
[...]
gpflaum commented 2 years ago

@lildude @smola You might have already received this information from my GitHub Enterprise Support ticket, but if I unpack the objects in my repo linguist doesn't crash. When I repack the objects, the the crash returns.

Unpack:

$ mv .git/objects/pack/pack-1d48e82611d821049a22f8abb3af7a8de58c12e8.pack /tmp
$ git unpack-objects < /tmp/pack-1d48e82611d821049a22f8abb3af7a8de58c12e8.pack 
Unpacking objects: 100% (250517/250517), 379.26 MiB | 250.00 KiB/s, done.

Now running github-linguist doesn't crash:

$ github-linguist .
90.54%  9496857    Objective-C
7.45%   781186     C
0.92%   96041      C++
0.57%   59335      HTML
0.24%   24697      Objective-C++
0.16%   16374      Shell
0.10%   10395      REXX
0.03%   2823       Perl
0.02%   1692       Ruby

Repack:

$ git repack -a
Enumerating objects: 250517, done.
Counting objects: 100% (250517/250517), done.
Delta compression using up to 8 threads
Compressing objects: 100% (240931/240931), done.
Writing objects: 100% (250517/250517), done.
Total 250517 (delta 203838), reused 0 (delta 0), pack-reused 0

Now github-linguist crashes again.

lildude commented 2 years ago

Thanks @gpflaum Yes, you're doing the right thing. I suspect your file may have a larger string in it than the one I've been testing with - mine is 116388 chars long. Increasing the figure higher may "solve" the issue, but this isn't really a fix.

I also saw your update in the support ticket (@smola wouldn't have as he isn't GitHub staff) and I was surprised to see unpacking made a difference... I'm not entirely sure why that is.

I've dug into this a bit more today and another possible solution, though I'd need to investigate the implications and properly validate it before declaring it as a fix, is to limit the amount of data we're passing into this buffer with this:

--- ext/linguist/linguist.c
+++ ext/linguist/linguist.c
@@ -16,8 +16,8 @@ static VALUE rb_tokenizer_extract_tokens(VALUE self, VALUE rb_data) {
        Check_Type(rb_data, T_STRING);

        len = RSTRING_LEN(rb_data);
-       if (len > 100000)
-               len = 100000;
+       if (len > 32000)
+               len = 32000;

        linguist_yylex_init_extra(&extra, &scanner);
        buf = linguist_yy_scan_bytes(RSTRING_PTR(rb_data), (int) len, scanner);

I picked 32000 at random and reverted my previous change of YY_BUF_SIZE:

➜ bundle exec rake compile
cd tmp/x86_64-linux/linguist/2.7.2
/usr/bin/make
compiling ../../../../ext/linguist/linguist.c
linking shared-object linguist/linguist.so
cd -
cp ext/linguist/linguist.c tmp/x86_64-linux/stage/ext/linguist/linguist.c
install -c tmp/x86_64-linux/linguist/2.7.2/linguist.so lib/linguist/linguist.so
cp tmp/x86_64-linux/linguist/2.7.2/linguist.so tmp/x86_64-linux/stage/lib/linguist/linguist.so
➜ bundle exec bin/github-linguist tmp/languages/ --breakdown
100.00% 116388     OpenEdge ABL

OpenEdge ABL:
t.w

➜ 

This doesn't feel quite right to me as this has been 100k for over 5 years without issues.

I'll keep investigating.

gpflaum commented 2 years ago

@lildude I reverted the YY_BUF_SIZE change in my repo and made your len = 32000 change and still crash. Added some debug code there and saw that it crashed on the first string >32000.

[...]
small string len: 5857
small string len: 3771
small string len: 1543
small string len: 4500
big string len: 35813
/Users/gpflaum/ws/linguist/lib/linguist/tokenizer.rb:17: [BUG] Segmentation fault at 0x00007f848b332000
[...]
        len = RSTRING_LEN(rb_data);
-       if (len > 100000)
-               len = 100000;
+       if (len > 32000) {
+               printf("big string len: %ld\n", len);
+               len = 32000;
+       } else {
+               printf("small string len: %ld\n", len);
+       }
lildude commented 2 years ago

🤔 I might have been lucky with my first random guess. As the buffer is only 16k, do you still see the crash if you use 16000 instead of 32000?

gpflaum commented 2 years ago

No crash if I limit len to 16000.

small string len: 5857
small string len: 3771
small string len: 1543
small string len: 4500
big string len: 35813
small string len: 4918
big string len: 36085
small string len: 12933
small string len: 9457
big string len: 51200
big string len: 41033
90.54%  9496857    Objective-C
7.45%   781186     C
0.92%   96041      C++
0.57%   59335      HTML
0.24%   24697      Objective-C++
0.16%   16374      Shell
0.10%   10395      REXX
0.03%   2823       Perl
0.02%   1692       Ruby
gpflaum commented 2 years ago

A couple interesting things:

lildude commented 2 years ago

💥 I've worked it out.

After learning waaaay more Lex that I was hoping to, the issue is caused by these lines:

<INITIAL,punct>[-!#$%&*+,.:;=>?@\\^_`|~]+/{not_punct}   { FEED(); BEGIN(INITIAL); return 1; }
<INITIAL,punct>[-!#$%&*+,.:;=>?@\\^_`|~]+/("<!--"|"/*") { FEED(); BEGIN(INITIAL); return 1; }
<INITIAL,punct>[</]+/{not_punct}          { FEED(); BEGIN(INITIAL); return 1; }
<INITIAL,punct>[</]+/"<!--"               { FEED(); BEGIN(INITIAL); return 1; }
<INITIAL,punct>[</]+/"/*"                 { FEED(); BEGIN(INITIAL); return 1; }

They all use what is apparently called "trailing context" which is equivalent to using REJECT and results in the unresizeable buffer which is causing these problems. Using REJECT is discouraged because of performance and because of this fixed size buffer.

Removing those lines and punct from the %x line gives us this crucial part of the diff in lex.linguist_yy.c which removes the REJECT definition and the fixed sized buffer we are crashing on:

-#define YY_TRAILING_MASK 0x2000
-#define YY_TRAILING_HEAD_MASK 0x4000
-#define REJECT \
-{ \
-*yy_cp = yyg->yy_hold_char; /* undo effects of setting up yytext */ \
-yy_cp = yyg->yy_full_match; /* restore poss. backed-over text */ \
-yyg->yy_lp = yyg->yy_full_lp; /* restore orig. accepting pos. */ \
-yyg->yy_state_ptr = yyg->yy_full_state; /* restore orig. state */ \
-yy_current_state = *yyg->yy_state_ptr; /* restore curr. state */ \
-++yyg->yy_lp; \
-goto find_rule; \
-}
-
-#define yymore() (yyg->yy_more_flag = 1)
-#define YY_MORE_ADJ yyg->yy_more_len
+/* The intent behind this definition is that it'll catch
+ * any uses of REJECT which flex missed.
+ */
+#define REJECT reject_used_but_not_detected
+#define yymore() yymore_used_but_not_detected
+#define YY_MORE_ADJ 0
 #define YY_RESTORE_YY_MORE_OFFSET
 #line 1 "tokenizer.l"
 #line 2 "tokenizer.l"

[... truncated for brevity ...]

@@ -1210,12 +1070,6 @@ YY_DECL
                YY_USER_INIT;
 #endif

-        /* Create the reject buffer large enough to save one state per allowed character. */
-        if ( ! yyg->yy_state_buf )
-            yyg->yy_state_buf = (yy_state_type *)yyalloc(YY_STATE_BUF_SIZE  , yyscanner);
-            if ( ! yyg->yy_state_buf )
-                YY_FATAL_ERROR( "out of dynamic memory in yylex()" );
-
                if ( ! yyg->yy_start )
                        yyg->yy_start = 1;      /* first start state */

@@ -1238,16 +1092,10 @@ YY_DECL
 #line 58 "tokenizer.l"

... and we're back to a working tokeniser.

I've also discovered we've fixed this once before in https://github.com/github/linguist/pull/3900 🤭

The challenge now is to implement the same functionality removed without using trailing contexts.

I'm going to give it until the end of today to come up with a solution. If I can't find a working solution, I'm going to revert the punctuation parts introduced in https://github.com/github/linguist/pull/5060 - a little less accuracy in token matching is better than the crashes.

lildude commented 2 years ago

I'm going to give it until the end of today to come up with a solution. If I can't find a working solution, I'm going to revert the punctuation parts introduced in #5060 - a little less accuracy in token matching is better than the crashes.

Sadly, I couldn't improve my newly found Lex skills enough to find a fix for these rules and keep the functionality in the time allotted so I've created https://github.com/github/linguist/pull/5956 to revert just those changes and add a crude test to catch this in the future.

I'm off next week so will merge it and make a new Linguist release in the week of 11 July 2022. I'm not sure how easy it will be to backport this release to older releases of GHES but I will try to go back to at least 3.4. I'll update this issue with the results of this attempt.

If someone with greater Lex skills can come up with alternate rules for the five identified in that time, please feel free to do so and I'll review on my return.

smola commented 2 years ago

Thanks @lildude! I was not able to spot the issue last time I tried debugging this.

I created https://github.com/github/linguist/pull/5969 with an alternative for punctuation tokenization, removing the use of trailing context.

lildude commented 2 years ago

I've backported just the gem update as far back as GHES 3.4 so this should be fixed in the next patch release.

lildude commented 2 years ago

I spoke too soon. It seems a lot of water has run under the bridge since GHES 3.4 was released so a clean backport isn't possible. I can only get as far back as 3.5.