natalie-lang / natalie

a work-in-progress Ruby compiler, written in Ruby and C++
https://natalie-lang.org
MIT License
942 stars 63 forks source link

Occasional Errono::EBADF when running tests #1452

Closed seven1m closed 1 year ago

seven1m commented 1 year ago

On my local machine about 1 out of 5 times I get this error:

        11: from spec/core/file/new_spec.rb:4:in `<main>'
        10: from /home/tim/pp/natalie/test/support/spec.rb:1469:in `block in block'
        9: from /home/tim/pp/natalie/test/support/spec.rb:1416:in `run_specs'
        8: from /home/tim/pp/natalie/test/support/spec.rb:1416:in `each'
        7: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `block in run_specs'
        6: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `reverse_each'
        5: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `block in block in run_specs'
        4: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `reverse_each'
        3: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `block in block in block in run_specs'
        2: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `call'
        1: from spec/core/file/new_spec.rb:13:in `block in block in block'
spec/core/file/new_spec.rb:13:in `close': Bad file descriptor (Errno::EBADF)

It's also occurring sometimes on CI: https://github.com/natalie-lang/natalie/actions/runs/6763302781/job/18381713844

herwinw commented 1 year ago

Of course this one has never happened to me. In the CI link, I can see it breaks on this test:

it "raises an Errorno::EEXIST if the file exists when create a new file with File::CREAT|File::EXCL" do
  -> { @fh = File.new(@file, File::CREAT|File::EXCL) }.should raise_error(Errno::EEXIST)
end

It almost looks like it raises an error, but still assigns the faulty File object to @fh, then the :after loop of the spec tries to close this faulty file descriptor and this results in a EBADF.

Maybe someone who's seen it crash on their system could try this loop just to see if it always breaks on this spec:

while bin/natalie spec/core/file/new_spec.rb; do true; done
herwinw commented 1 year ago

(I just pushed a fix to ruby-spec to change Errorno:: to Errno::, just noticed that typo)

seven1m commented 1 year ago

I've now started getting it almost every time, for whatever reason:

→ bin/natalie spec/core/file/new_spec.rb
...........Traceback (most recent call last):
        11: from spec/core/file/new_spec.rb:4:in `<main>'
        10: from /home/tim/pp/natalie/test/support/spec.rb:1469:in `block in block'
        9: from /home/tim/pp/natalie/test/support/spec.rb:1416:in `run_specs'
        8: from /home/tim/pp/natalie/test/support/spec.rb:1416:in `each'
        7: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `block in run_specs'
        6: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `reverse_each'
        5: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `block in block in run_specs'
        4: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `reverse_each'
        3: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `block in block in block in run_specs'
        2: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `call'
        1: from spec/core/file/new_spec.rb:13:in `block in block in block'
spec/core/file/new_spec.rb:13:in `close': Bad file descriptor (Errno::EBADF)

If I rescue the failure and print the @fs...

after :each do
  @fh.close if @fh
  rm_r @file
rescue
  p(fh: @fh)
  p(closed?: @fh.closed?)
  p(fileno: @fh.fileno)
  p(eof?: @fh.eof?)
end

...this is the result:

→ bin/natalie spec/core/file/new_spec.rb

...........
{:fh=>#<File:/home/tim/pp/natalie/rubyspec_temp/859680/11-test.txt>}
{:closed?=>false}
{:fileno=>3}
Traceback (most recent call last):
        11: from spec/core/file/new_spec.rb:4:in `<main>'
        10: from /home/tim/pp/natalie/test/support/spec.rb:1469:in `block in block'
        9: from /home/tim/pp/natalie/test/support/spec.rb:1416:in `run_specs'
        8: from /home/tim/pp/natalie/test/support/spec.rb:1416:in `each'
        7: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `block in run_specs'
        6: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `reverse_each'
        5: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `block in block in run_specs'
        4: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `reverse_each'
        3: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `block in block in block in run_specs'
        2: from /home/tim/pp/natalie/test/support/spec.rb:1452:in `call'
        1: from spec/core/file/new_spec.rb:19:in `block in block in block'
spec/core/file/new_spec.rb:19:in `eof?': Bad file descriptor (Errno::EBADF)
herwinw commented 1 year ago
spec/core/file/new_spec.rb:19:in `eof?': Bad file descriptor (Errno::EBADF)

That's kind of weird, line 19 looks like this for me:

 17   it "returns a new File with mode string" do
 18     @fh = File.new(@file, 'w')
 19     @fh.should be_kind_of(File)
 20     File.should.exist?(@file)
 21   end

So how does that relate to eof??

Update: nevermind, that's the debug info you added. I'm going to get another cup of coffee

herwinw commented 1 year ago

Could you add this line as the first line of the rescue block?

puts(`lsof -p #{$$}`)

That prints all the current file descriptors, which I guess should not include 3 in the error case.

seven1m commented 1 year ago

Sure:

Elsof: WARNING: can't stat() overlay file system /var/lib/docker/overlay2/4d8875f8fd8bed574a6f1cd416f1b474148131ac86e4e683169e9135b0025675/merged
      Output information may be incomplete.
lsof: WARNING: can't stat() nsfs file system /run/docker/netns/054b5fd3b0cb
      Output information may be incomplete.
COMMAND      PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
natalie20 919732  tim  cwd    DIR  259,2     4096 6955954 /home/tim/pp/natalie
natalie20 919732  tim  rtd    DIR  259,2     4096       2 /
natalie20 919732  tim  txt    REG  259,2  1652088 2622209 /tmp/natalie20231107-919697-4shdam
natalie20 919732  tim  mem    REG  259,2  5716304  264308 /usr/lib/locale/locale-archive
natalie20 919732  tim  mem    REG  259,2  2105184  263605 /usr/lib/x86_64-linux-gnu/libc.so.6
natalie20 919732  tim  mem    REG  259,2   957008  264212 /usr/lib/x86_64-linux-gnu/libm.so.6
natalie20 919732  tim  mem    REG  259,2  2522480  265943 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.32
natalie20 919732  tim  mem    REG  259,2   141872  266255 /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
natalie20 919732  tim  mem    REG  259,2   198664  262914 /usr/lib/x86_64-linux-gnu/libcrypt.so.1.1.0
natalie20 919732  tim  mem    REG  259,2  1514968 6954833 /home/tim/pp/natalie/build/onigmo/lib/libonigmo.so.6.0.5
natalie20 919732  tim  mem    REG  259,2 20404136 6976121 /home/tim/pp/natalie/build/libnatalie_base.so
natalie20 919732  tim  mem    REG  259,2   232568  263234 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
natalie20 919732  tim    0u   CHR  136,3      0t0       6 /dev/pts/3
natalie20 919732  tim    1u   CHR  136,3      0t0       6 /dev/pts/3
natalie20 919732  tim    2u   CHR  136,3      0t0       6 /dev/pts/3
natalie20 919732  tim    3r  FIFO   0,14      0t0 3145468 pipe
{:fh=>#<File:/home/tim/pp/natalie/rubyspec_temp/919732/17-test.txt>}
{:fileno=>3}
seven1m commented 1 year ago

If I start printing stuff, the heisenbug goes away!

I added a dbg("opened {} as fileno {}", filename, fileno); on line 61 of FileObject::initialize and all the specs started passing:

opened /home/tim/pp/natalie/rubyspec_temp/931780/1-test.txt as fileno 3  
opened /home/tim/pp/natalie/rubyspec_temp/931780/1-test.txt as fileno 3  
.opened /home/tim/pp/natalie/rubyspec_temp/931780/2-test.txt as fileno 3 
opened /home/tim/pp/natalie/rubyspec_temp/931780/2-test.txt as fileno 3  
.opened /home/tim/pp/natalie/rubyspec_temp/931780/3-test.txt as fileno 3 
opened /home/tim/pp/natalie/rubyspec_temp/931780/3-test.txt as fileno 3  
.opened /home/tim/pp/natalie/rubyspec_temp/931780/4-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/4-test.txt as fileno 3  
opened /home/tim/pp/natalie/rubyspec_temp/931780/4-test.txt as fileno 3  
.opened /home/tim/pp/natalie/rubyspec_temp/931780/5-test.txt as fileno 3 
opened /home/tim/pp/natalie/rubyspec_temp/931780/5-test.txt as fileno 3  
opened /home/tim/pp/natalie/rubyspec_temp/931780/5-test.txt as fileno 3  
.opened /home/tim/pp/natalie/rubyspec_temp/931780/6-test.txt as fileno 3  
opened /home/tim/pp/natalie/rubyspec_temp/931780/6-test.txt as fileno 3  
.opened /home/tim/pp/natalie/rubyspec_temp/931780/7-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/7-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/8-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/8-test.txt as fileno 3  
.opened /home/tim/pp/natalie/rubyspec_temp/931780/9-test.txt as fileno 3                                                                                                                                           
opened /home/tim/pp/natalie/rubyspec_temp/931780/9-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/10-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/10-test.txt as fileno -1
.opened /home/tim/pp/natalie/rubyspec_temp/931780/11-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/11-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/12-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/12-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/13-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/13-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/14-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/14-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/15-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/15-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/16-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/16-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/17-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/17-test.txt as fileno 3
{:before=>3}
{:before=>3}
.opened /home/tim/pp/natalie/rubyspec_temp/931780/18-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/18-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/19-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/19-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/20-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/20-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/21-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/21-test.txt as fileno -1
opened /home/tim/pp/natalie/rubyspec_temp/931780/21-test.txt as fileno -1
.opened /home/tim/pp/natalie/rubyspec_temp/931780/22-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/22-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/23-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/24-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/25-test.txt as fileno 3
.opened /home/tim/pp/natalie/rubyspec_temp/931780/26-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/26-test.txt as fileno 3
*.opened /home/tim/pp/natalie/rubyspec_temp/931780/27-test.txt as fileno 3
opened /home/tim/pp/natalie/rubyspec_temp/931780/ as fileno 3
.

27 spec(s) passed.
1 spec(s) skipped.
seven1m commented 1 year ago

OK, and one last bit of data. Just to be sure my system doesn't always use fileno 3, I ran this one-liner:

→ bin/natalie -e "f1 = File.open('README.md'); f2 = File.open('LICENSE'); f3 = File.open('CONTRIBUTING.md')"
opened README.md as fileno 3
opened LICENSE as fileno 4
opened CONTRIBUTING.md as fileno 5
seven1m commented 1 year ago

Last last bit of data: I can not reproduce this in the Docker container with rake docker_bash. The spec always passes, even if I run it in a while loop a hundred times. And the fileno inside the Docker container is 3 also. I'm thoroughly confused!

seven1m commented 1 year ago

Through some high-tech printf debugging, I've narrowed the cause down to this:

~IoObject() closing 3 () object_id = 93939119179040
~IoObject() closed 3 with result 0                                                                       
IoObject::close() closing 3 object_id=93939119172896
IoObject::close() closed 3 with result -1

Basically, the ~IoObject() destructor on some other object with fileno 3 is closing it before the IoObject::close() can do it.

Now, why there's still an object with m_fileno = 3 around to be GC'd is my next question...

seven1m commented 1 year ago

I figured it out! PR incoming...

seven1m commented 1 year ago

The culprit was this:

https://github.com/natalie-lang/natalie/blob/master/spec/core/file/new_spec.rb#L75-L76

Those lines effectively create a copy of the fileno, and in combination with our destructor on IoObject (called by the GC), the fileno was getting closed even though autoclose was set to false.

herwinw commented 1 year ago

Note to self: #1270 sounds pretty similar to this issue. I was reminded of this one because I ran into issues with TCPSocket.open(&block) We might be able to remove the call to fsync(2) from this code.