Closed lildude closed 2 years ago
For those that encounter the issue until it is fixed, the workaround is to use any override in the .gitattributes
file that will prevent the analysis from falling through to the classifier.
Any of the following would do the trick:
# Explicitly state the language and keep it counting towards the language stats. Replace <language> with actual language
path/to/your/file linguist-language=<language>
# Mark the file as vendored, exclude it from the language stats and have it appear expanded in diffs
path/to/your/file linguist-vendored
# Mark the file as generated, exclude it from the language stats and have it appear expanded in diffs
path/to/your/file linguist-generated
If you have multiple affected files, use wildcards as necessary.
I've isolated this down to https://github.com/github/linguist/pull/5060 being the PR that introduced this bug.
Running the analysis with valgrind eventually gives me:
➜ valgrind bundle exec bin/github-linguist tmp/languages/ --breakdown
[...]
==27914== Invalid write of size 4
==27914== at 0xBDA849F: linguist_yylex (lex.linguist_yy.c:1278)
==27914== by 0xBDA9E66: rb_tokenizer_extract_tokens (linguist.c:28)
==27914== by 0x4ABE6C5: vm_call_cfunc_with_frame (vm_insnhelper.c:2514)
==27914== by 0x4ABE6C5: vm_call_cfunc (vm_insnhelper.c:2539)
==27914== by 0x4AD9549: vm_call_method (vm_insnhelper.c:3053)
==27914== by 0x4ACA6C6: vm_sendish (vm_insnhelper.c:4023)
==27914== by 0x4ACA6C6: vm_exec_core (insns.def:801)
==27914== by 0x4AD081A: rb_vm_exec (vm.c:1920)
==27914== by 0x4ADBF38: invoke_block (vm.c:1044)
==27914== by 0x4ADBF38: invoke_iseq_block_from_c (vm.c:1116)
==27914== by 0x4ADBF38: invoke_block_from_c_bh (vm.c:1134)
==27914== by 0x4ADBF38: vm_yield (vm.c:1179)
==27914== by 0x4ADBF38: rb_yield_0 (vm_eval.c:1227)
==27914== by 0x4ADBF38: rb_yield_1 (vm_eval.c:1233)
==27914== by 0x4ADBF38: rb_yield (vm_eval.c:1243)
==27914== by 0x4880B6B: rb_ary_each (array.c:2135)
==27914== by 0x4ABE6C5: vm_call_cfunc_with_frame (vm_insnhelper.c:2514)
==27914== by 0x4ABE6C5: vm_call_cfunc (vm_insnhelper.c:2539)
==27914== by 0x4AD9549: vm_call_method (vm_insnhelper.c:3053)
==27914== by 0x4ACBBD4: vm_sendish (vm_insnhelper.c:4023)
==27914== by 0x4ACBBD4: vm_exec_core (insns.def:782)
==27914== by 0x4AD081A: rb_vm_exec (vm.c:1920)
==27914== Address 0x12f52df8 is 0 bytes after a block of size 65,544 alloc'd
==27914== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==27914== by 0xBDA8816: linguist_yylex (lex.linguist_yy.c:1215)
==27914== by 0xBDA9E66: rb_tokenizer_extract_tokens (linguist.c:28)
==27914== by 0x4ABE6C5: vm_call_cfunc_with_frame (vm_insnhelper.c:2514)
==27914== by 0x4ABE6C5: vm_call_cfunc (vm_insnhelper.c:2539)
==27914== by 0x4AD9549: vm_call_method (vm_insnhelper.c:3053)
==27914== by 0x4ACA6C6: vm_sendish (vm_insnhelper.c:4023)
==27914== by 0x4ACA6C6: vm_exec_core (insns.def:801)
==27914== by 0x4AD081A: rb_vm_exec (vm.c:1920)
==27914== by 0x4ADBF38: invoke_block (vm.c:1044)
==27914== by 0x4ADBF38: invoke_iseq_block_from_c (vm.c:1116)
==27914== by 0x4ADBF38: invoke_block_from_c_bh (vm.c:1134)
==27914== by 0x4ADBF38: vm_yield (vm.c:1179)
==27914== by 0x4ADBF38: rb_yield_0 (vm_eval.c:1227)
==27914== by 0x4ADBF38: rb_yield_1 (vm_eval.c:1233)
==27914== by 0x4ADBF38: rb_yield (vm_eval.c:1243)
==27914== by 0x4880B6B: rb_ary_each (array.c:2135)
==27914== by 0x4ABE6C5: vm_call_cfunc_with_frame (vm_insnhelper.c:2514)
==27914== by 0x4ABE6C5: vm_call_cfunc (vm_insnhelper.c:2539)
==27914== by 0x4AD9549: vm_call_method (vm_insnhelper.c:3053)
==27914== by 0x4ACBBD4: vm_sendish (vm_insnhelper.c:4023)
==27914== by 0x4ACBBD4: vm_exec_core (insns.def:782)
==27914==
valgrind: m_mallocfree.c:305 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 65616, hi = 304942678087.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata. If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away. Please try that before reporting this as a bug.
[...]
This leads me here, specifically the line with YY_STATE_BUF_SIZE
:
/* Create the reject buffer large enough to save one state per allowed character. */
if ( ! yyg->yy_state_buf )
yyg->yy_state_buf = (yy_state_type *)yyalloc(YY_STATE_BUF_SIZE , yyscanner);
if ( ! yyg->yy_state_buf )
YY_FATAL_ERROR( "out of dynamic memory in yylex()" );
This is defined here:
/* Size of default input buffer. */
#ifndef YY_BUF_SIZE
#ifdef __ia64__
/* On IA-64, the buffer size is 16k, not 8k.
* Moreover, YY_BUF_SIZE is 2*YY_READ_BUF_SIZE in the general case.
* Ditto for the __ia64__ case accordingly.
*/
#define YY_BUF_SIZE 32768
#else
#define YY_BUF_SIZE 16384
#endif /* __ia64__ */
#endif
/* The state buf must be large enough to hold one state per character in the main buffer.
*/
#define YY_STATE_BUF_SIZE ((YY_BUF_SIZE + 2) * sizeof(yy_state_type))
My naïve knowledge of C makes me read this as on ia64 we set a buffer size of ~32K and for everything else we set the buffer to ~16k.
As this is not an IA64 (itanium) architecture we're using the smaller buffer.
So what happens if we bump the non-ia64 buffer to match?
Bingo!!! It works:
➜ bundle exec rake compile
cp lib/linguist/samples.json tmp/x86_64-linux/stage/lib/linguist/samples.json
install -c tmp/x86_64-linux/linguist/2.7.2/linguist.so lib/linguist/linguist.so
cp tmp/x86_64-linux/linguist/2.7.2/linguist.so tmp/x86_64-linux/stage/lib/linguist/linguist.so
➜ bundle exec bin/github-linguist tmp/languages/ --breakdown
100.00% 116388 OpenEdge ABL
OpenEdge ABL:
t.w
I have no idea if this is the right fix, but it seems to do the trick.
@smola what do you think?
This is great detective work, thanks @lildude! Your pointers led me to some additional documentation that is relevant:
It seems like bumping the size of YY_BUF_SIZE
will work in this particular case, but we still run the risk of some other file eventually producing a token bigger than even the larger 32K value. The SO thread also suggests that we might be able to modify the lexer itself to get around this:
So your best solution is to avoid matching huge tokens; for example, strings and comments can be tokenized one line at a time instead of being tokenized as one enormous token.
That seems like it's worth trying, though the flex source is a bit arcane to me.
@lildude I tried making your change, and still have the crash. Perhaps I'm not building it right? Changes:
$ git diff ext/linguist/lex.linguist_yy.c
diff --git a/ext/linguist/lex.linguist_yy.c b/ext/linguist/lex.linguist_yy.c
index fe0025b3..1a8823ae 100644
--- a/ext/linguist/lex.linguist_yy.c
+++ b/ext/linguist/lex.linguist_yy.c
@@ -363,7 +363,7 @@ typedef void* yyscan_t;
*/
#define YY_BUF_SIZE 32768
#else
-#define YY_BUF_SIZE 16384
+#define YY_BUF_SIZE 32768
#endif /* __ia64__ */
#endif
$ git diff ext/linguist/lex.linguist_yy.h
diff --git a/ext/linguist/lex.linguist_yy.h b/ext/linguist/lex.linguist_yy.h
index 9c9fc547..f1499ba5 100644
--- a/ext/linguist/lex.linguist_yy.h
+++ b/ext/linguist/lex.linguist_yy.h
@@ -341,7 +341,7 @@ typedef void* yyscan_t;
*/
#define YY_BUF_SIZE 32768
#else
-#define YY_BUF_SIZE 16384
+#define YY_BUF_SIZE 32768
#endif /* __ia64__ */
#endif
Building:
$ bundle exec rake compile
cd tmp/x86_64-darwin20/linguist/3.0.0
/usr/bin/make
compiling ../../../../ext/linguist/lex.linguist_yy.c
compiling ../../../../ext/linguist/linguist.c
linking shared-object linguist/linguist.bundle
cd -
cp ext/linguist/lex.linguist_yy.c tmp/x86_64-darwin20/stage/ext/linguist/lex.linguist_yy.c
cp ext/linguist/lex.linguist_yy.h tmp/x86_64-darwin20/stage/ext/linguist/lex.linguist_yy.h
cp lib/linguist/samples.json tmp/x86_64-darwin20/stage/lib/linguist/samples.json
install -c tmp/x86_64-darwin20/linguist/3.0.0/linguist.bundle lib/linguist/linguist.bundle
cp tmp/x86_64-darwin20/linguist/3.0.0/linguist.bundle tmp/x86_64-darwin20/stage/lib/linguist/linguist.bundle
Testing:
$ bundle exec bin/github-linguist ../linguist-crash
/Users/gpflaum/ws/linguist/lib/linguist/tokenizer.rb:17: [BUG] Segmentation fault at 0x00007fcb61caf000
[...]
@lildude @smola You might have already received this information from my GitHub Enterprise Support ticket, but if I unpack the objects in my repo linguist doesn't crash. When I repack the objects, the the crash returns.
Unpack:
$ mv .git/objects/pack/pack-1d48e82611d821049a22f8abb3af7a8de58c12e8.pack /tmp
$ git unpack-objects < /tmp/pack-1d48e82611d821049a22f8abb3af7a8de58c12e8.pack
Unpacking objects: 100% (250517/250517), 379.26 MiB | 250.00 KiB/s, done.
Now running github-linguist doesn't crash:
$ github-linguist .
90.54% 9496857 Objective-C
7.45% 781186 C
0.92% 96041 C++
0.57% 59335 HTML
0.24% 24697 Objective-C++
0.16% 16374 Shell
0.10% 10395 REXX
0.03% 2823 Perl
0.02% 1692 Ruby
Repack:
$ git repack -a
Enumerating objects: 250517, done.
Counting objects: 100% (250517/250517), done.
Delta compression using up to 8 threads
Compressing objects: 100% (240931/240931), done.
Writing objects: 100% (250517/250517), done.
Total 250517 (delta 203838), reused 0 (delta 0), pack-reused 0
Now github-linguist crashes again.
Thanks @gpflaum Yes, you're doing the right thing. I suspect your file may have a larger string in it than the one I've been testing with - mine is 116388 chars long. Increasing the figure higher may "solve" the issue, but this isn't really a fix.
I also saw your update in the support ticket (@smola wouldn't have as he isn't GitHub staff) and I was surprised to see unpacking made a difference... I'm not entirely sure why that is.
I've dug into this a bit more today and another possible solution, though I'd need to investigate the implications and properly validate it before declaring it as a fix, is to limit the amount of data we're passing into this buffer with this:
--- ext/linguist/linguist.c
+++ ext/linguist/linguist.c
@@ -16,8 +16,8 @@ static VALUE rb_tokenizer_extract_tokens(VALUE self, VALUE rb_data) {
Check_Type(rb_data, T_STRING);
len = RSTRING_LEN(rb_data);
- if (len > 100000)
- len = 100000;
+ if (len > 32000)
+ len = 32000;
linguist_yylex_init_extra(&extra, &scanner);
buf = linguist_yy_scan_bytes(RSTRING_PTR(rb_data), (int) len, scanner);
I picked 32000 at random and reverted my previous change of YY_BUF_SIZE
:
➜ bundle exec rake compile
cd tmp/x86_64-linux/linguist/2.7.2
/usr/bin/make
compiling ../../../../ext/linguist/linguist.c
linking shared-object linguist/linguist.so
cd -
cp ext/linguist/linguist.c tmp/x86_64-linux/stage/ext/linguist/linguist.c
install -c tmp/x86_64-linux/linguist/2.7.2/linguist.so lib/linguist/linguist.so
cp tmp/x86_64-linux/linguist/2.7.2/linguist.so tmp/x86_64-linux/stage/lib/linguist/linguist.so
➜ bundle exec bin/github-linguist tmp/languages/ --breakdown
100.00% 116388 OpenEdge ABL
OpenEdge ABL:
t.w
➜
This doesn't feel quite right to me as this has been 100k for over 5 years without issues.
I'll keep investigating.
@lildude I reverted the YY_BUF_SIZE change in my repo and made your len = 32000 change and still crash. Added some debug code there and saw that it crashed on the first string >32000.
[...]
small string len: 5857
small string len: 3771
small string len: 1543
small string len: 4500
big string len: 35813
/Users/gpflaum/ws/linguist/lib/linguist/tokenizer.rb:17: [BUG] Segmentation fault at 0x00007f848b332000
[...]
len = RSTRING_LEN(rb_data);
- if (len > 100000)
- len = 100000;
+ if (len > 32000) {
+ printf("big string len: %ld\n", len);
+ len = 32000;
+ } else {
+ printf("small string len: %ld\n", len);
+ }
🤔 I might have been lucky with my first random guess. As the buffer is only 16k, do you still see the crash if you use 16000 instead of 32000?
No crash if I limit len to 16000.
small string len: 5857
small string len: 3771
small string len: 1543
small string len: 4500
big string len: 35813
small string len: 4918
big string len: 36085
small string len: 12933
small string len: 9457
big string len: 51200
big string len: 41033
90.54% 9496857 Objective-C
7.45% 781186 C
0.92% 96041 C++
0.57% 59335 HTML
0.24% 24697 Objective-C++
0.16% 16374 Shell
0.10% 10395 REXX
0.03% 2823 Perl
0.02% 1692 Ruby
A couple interesting things:
💥 I've worked it out.
After learning waaaay more Lex that I was hoping to, the issue is caused by these lines:
<INITIAL,punct>[-!#$%&*+,.:;=>?@\\^_`|~]+/{not_punct} { FEED(); BEGIN(INITIAL); return 1; }
<INITIAL,punct>[-!#$%&*+,.:;=>?@\\^_`|~]+/("<!--"|"/*") { FEED(); BEGIN(INITIAL); return 1; }
<INITIAL,punct>[</]+/{not_punct} { FEED(); BEGIN(INITIAL); return 1; }
<INITIAL,punct>[</]+/"<!--" { FEED(); BEGIN(INITIAL); return 1; }
<INITIAL,punct>[</]+/"/*" { FEED(); BEGIN(INITIAL); return 1; }
They all use what is apparently called "trailing context" which is equivalent to using REJECT and results in the unresizeable buffer which is causing these problems. Using REJECT is discouraged because of performance and because of this fixed size buffer.
Removing those lines and punct
from the %x
line gives us this crucial part of the diff in lex.linguist_yy.c
which removes the REJECT
definition and the fixed sized buffer we are crashing on:
-#define YY_TRAILING_MASK 0x2000
-#define YY_TRAILING_HEAD_MASK 0x4000
-#define REJECT \
-{ \
-*yy_cp = yyg->yy_hold_char; /* undo effects of setting up yytext */ \
-yy_cp = yyg->yy_full_match; /* restore poss. backed-over text */ \
-yyg->yy_lp = yyg->yy_full_lp; /* restore orig. accepting pos. */ \
-yyg->yy_state_ptr = yyg->yy_full_state; /* restore orig. state */ \
-yy_current_state = *yyg->yy_state_ptr; /* restore curr. state */ \
-++yyg->yy_lp; \
-goto find_rule; \
-}
-
-#define yymore() (yyg->yy_more_flag = 1)
-#define YY_MORE_ADJ yyg->yy_more_len
+/* The intent behind this definition is that it'll catch
+ * any uses of REJECT which flex missed.
+ */
+#define REJECT reject_used_but_not_detected
+#define yymore() yymore_used_but_not_detected
+#define YY_MORE_ADJ 0
#define YY_RESTORE_YY_MORE_OFFSET
#line 1 "tokenizer.l"
#line 2 "tokenizer.l"
[... truncated for brevity ...]
@@ -1210,12 +1070,6 @@ YY_DECL
YY_USER_INIT;
#endif
- /* Create the reject buffer large enough to save one state per allowed character. */
- if ( ! yyg->yy_state_buf )
- yyg->yy_state_buf = (yy_state_type *)yyalloc(YY_STATE_BUF_SIZE , yyscanner);
- if ( ! yyg->yy_state_buf )
- YY_FATAL_ERROR( "out of dynamic memory in yylex()" );
-
if ( ! yyg->yy_start )
yyg->yy_start = 1; /* first start state */
@@ -1238,16 +1092,10 @@ YY_DECL
#line 58 "tokenizer.l"
... and we're back to a working tokeniser.
I've also discovered we've fixed this once before in https://github.com/github/linguist/pull/3900 🤭
The challenge now is to implement the same functionality removed without using trailing contexts.
I'm going to give it until the end of today to come up with a solution. If I can't find a working solution, I'm going to revert the punctuation parts introduced in https://github.com/github/linguist/pull/5060 - a little less accuracy in token matching is better than the crashes.
I'm going to give it until the end of today to come up with a solution. If I can't find a working solution, I'm going to revert the punctuation parts introduced in #5060 - a little less accuracy in token matching is better than the crashes.
Sadly, I couldn't improve my newly found Lex skills enough to find a fix for these rules and keep the functionality in the time allotted so I've created https://github.com/github/linguist/pull/5956 to revert just those changes and add a crude test to catch this in the future.
I'm off next week so will merge it and make a new Linguist release in the week of 11 July 2022. I'm not sure how easy it will be to backport this release to older releases of GHES but I will try to go back to at least 3.4. I'll update this issue with the results of this attempt.
If someone with greater Lex skills can come up with alternate rules for the five identified in that time, please feel free to do so and I'll review on my return.
Thanks @lildude! I was not able to spot the issue last time I tried debugging this.
I created https://github.com/github/linguist/pull/5969 with an alternative for punctuation tokenization, removing the use of trailing context.
I've backported just the gem update as far back as GHES 3.4 so this should be fixed in the next patch release.
I spoke too soon. It seems a lot of water has run under the bridge since GHES 3.4 was released so a clean backport isn't possible. I can only get as far back as 3.5.
Moving over from discussions as this is definitely a bug.
Originally discussed in https://github.com/github/linguist/discussions/5876
As pointed out in the discussion, 7.13 was the first release after @smola made changes to the tokenizer in https://github.com/github/linguist/pull/5205 and https://github.com/github/linguist/pull/5186 and https://github.com/github/linguist/pull/5061