codylane commented 4 years ago

Problem Statement:

GH-ISSUE-218

Technical Review Synopsis:

GH-ISSUE-218

chucknelson commented 4 years ago

@codylane - Picking up conversation here - additional info:

Same error happens with Vagrant 2.2.8 and 2.2.9.
Using Windows 7 (😭 )
Since it does seem related to a core Ruby feature (Tempfile), I wonder if it's related to the embedded Ruby version Vagrant uses during commands - for 2.2.8 they seem to have moved from Ruby 2.4.9 to Ruby 2.6.6.

codylane commented 4 years ago

Bummer. Ok, let's keep digging into this as it may be a ruby thing.

I know how the tempfile support is supposed to work in C but I don't really understand the ruby version all that well. Let me dig a little in the ruby code base and see what I can grok/understand.

I'm not much of a windows guy but I'll try and find a windows 7 box and see if I can reproduce this issue. I'm starting to think this may affect every windows user.

Thanks for sharing the link to the ruby docs as well. This line is of interest to me

Good practices

Explicit close

When a Tempfile object is garbage collected, or when the Ruby interpreter exits, its associated temporary file is automatically deleted. This means that's it's unnecessary to explicitly delete a Tempfile after use, though it's good practice to do so: not explicitly deleting unused Tempfiles can potentially leave behind large amounts of tempfiles on the filesystem until they're garbage collected. The existence of these temp files can make it harder to determine a new Tempfile filename.

Therefore, one should always call unlink or close in an ensure block, like this:

file = Tempfile.new('foo')
begin
   # ...do something with file...
ensure
   file.close
   file.unlink   # deletes the temp file
end

Unlink after creation

On POSIX systems, it's possible to unlink a file right after creating it, and before closing it. This removes the filesystem entry without closing the file handle, so it ensures that only the processes that already had the file handle open can access the file's contents. It's strongly recommended that you do this if you do not want any other processes to be able to read from or write to the Tempfile, and you do not need to know the Tempfile's filename either.

For example, a practical use case for unlink-after-creation would be this: you need a large byte buffer that's too large to comfortably fit in RAM, e.g. when you're writing a web server and you want to buffer the client's file upload data.

Please refer to unlink for more information and a code example.

Unlink-before-close

On POSIX systems it's possible to unlink a file before closing it. This practice is explained in detail in the Tempfile overview (section “Unlink after creation”); please refer there for more information.

However, unlink-before-close may not be supported on non-POSIX operating systems. Microsoft Windows is the most notable case: unlinking a non-closed file will result in an error, which this method will silently ignore. If you want to practice unlink-before-close whenever possible, then you should write code like this:

file = Tempfile.new('foo')
file.unlink   # On Windows this silently fails.
begin
   # ... do something with file ...
ensure
   file.close!   # Closes the file handle. If the file wasn't unlinked
                 # because #unlink failed, then this method will attempt
                 # to do so again.
end

Vagrant-proxyconf's implementation

require 'tempfile'

def tempfile(content)
  Tempfile.new("vagrant").tap do |temp|
    temp.binmode
    temp.write(content)
    temp.close
  end
end

I suspect the close is what's causing the problems for windows users. I have a feeling this may not be an atomic operation on windows. I'd love to setup a simulation that runs a script 1000 times and reports the number of times of pass and fail of each run. Let me see if I can whip something up.

codylane commented 4 years ago

Simulation Code

I've not had much luck setting up Windows and I've lost a few hours to trying. So I'm wondering if you might be able to run this on you workstation for me and paste the results?

Simulation Code

#!/usr/bin/env ruby

require 'date'
require 'json'
require 'tempfile'

$DEFAULT_RUNS = 1000

def tempfile(content)

  Tempfile.new("vagrant").tap do |temp|
    temp.binmode
    temp.write(content)
    temp.close
  end

end

def now
    DateTime.now.to_time.to_f
end

def simulate(runs=nil)
  runs = runs || ARGV[0] || $DEFAULT_RUNS
  runs = runs.to_i

  metrics = {
    :pass         => 0,
    :fail         => 0,
    :temp_files   => [],
    :iterations   => runs,
    :started_at   => now,
    :ended_at     => now,
    :run_duration => 0.0,
  }

  (1..runs).each do |index|

    begin
      tmp = tempfile("foo")

      metrics[:pass] += 1
      metrics[:temp_files] << {:filename => tmp.path, :exists? => File.exists?(tmp.path)}

    rescue
      metrics[:fail] += 1
    end
  end

  metrics[:ended_at] = now,
  metrics[:run_duration] = metrics[:ended_at] - metrics[:started_at]

  metrics
end

puts JSON.pretty_generate(simulate())

Simulation Usage

You'll need to find the path to the vagrant ruby executable to run the code above and then use something like this to execute the test.

Since I'm a linux guy here's how I would do it on linux.

/opt/vagrant/embedded/bin/ruby test_tempfile.rb 10

This would run the simulation tests 10 times and report the temp file names and whether or not the temp file actually exists on disk when the function returns. From the docs I understand that the file will be deleted when the code goes out of scope, i.e. when either the garbage collected or when the program is terminated.

The output from the tests is a nice little JSON blob so that you could test for the file was actually deleted when the script terminates. I'll leave that for a task for you if you'd like to explore that. ;) I'll be happy to help if you'd like.

Here's what it looks like running from my MacOSX

{
  "pass": 10,
  "fail": 0,
  "temp_files": [
    {
      "filename": "/var/folders/36/_92163xs6tq2dmhr5y6613rm0000gn/T/vagrant20200605-54071-1xk0tm7",
      "exists?": true
    },
    {
      "filename": "/var/folders/36/_92163xs6tq2dmhr5y6613rm0000gn/T/vagrant20200605-54071-1e0nbdx",
      "exists?": true
    },
    {
      "filename": "/var/folders/36/_92163xs6tq2dmhr5y6613rm0000gn/T/vagrant20200605-54071-knef9a",
      "exists?": true
    },
    {
      "filename": "/var/folders/36/_92163xs6tq2dmhr5y6613rm0000gn/T/vagrant20200605-54071-1fp5iy8",
      "exists?": true
    },
    {
      "filename": "/var/folders/36/_92163xs6tq2dmhr5y6613rm0000gn/T/vagrant20200605-54071-pvm5ad",
      "exists?": true
    },
    {
      "filename": "/var/folders/36/_92163xs6tq2dmhr5y6613rm0000gn/T/vagrant20200605-54071-1nga0om",
      "exists?": true
    },
    {
      "filename": "/var/folders/36/_92163xs6tq2dmhr5y6613rm0000gn/T/vagrant20200605-54071-1sfm2ah",
      "exists?": true
    },
    {
      "filename": "/var/folders/36/_92163xs6tq2dmhr5y6613rm0000gn/T/vagrant20200605-54071-1dfh1y2",
      "exists?": true
    },
    {
      "filename": "/var/folders/36/_92163xs6tq2dmhr5y6613rm0000gn/T/vagrant20200605-54071-12670yj",
      "exists?": true
    },
    {
      "filename": "/var/folders/36/_92163xs6tq2dmhr5y6613rm0000gn/T/vagrant20200605-54071-yrbwqc",
      "exists?": true
    }
  ],
  "iterations": 10,
  "started_at": 1591381460.020482,
  "ended_at": [
    1591381460.024221,
    0.0017809867858886719
  ],
  "run_duration": 0.0017809867858886719
}

chucknelson commented 4 years ago

@codylane - Tried out this script with both versions of embedded ruby and can't get it to fail, regardless of how many iterations I specify :frowning:

I'm able to reproduce the error pretty easily with vagrant up or vagrant provision, so at least it's consistently happening! I'll poke around more as I find time - if you have anything else you want me to test, let me know.

For reference, here are the ruby versions Vagrant is using:

# Vagrant 2.2.7
C:\HashiCorp\Vagrant\embedded\mingw64\bin\ruby --version
ruby 2.4.9p362 (2019-10-02 revision 67824) [x64-mingw32]

# Vagrant 2.2.8
C:\HashiCorp\Vagrant\embedded\mingw64\bin\ruby --version
ruby 2.6.6p146 (2020-03-31 revision 67876) [x64-mingw32]

I added a count of exists?() returning true or false, and it's always matching with the script, for example:

{
  ...
  "iterations": 50000,
  "started_at": 1591623400.752482,
  "ended_at": 1591623619.456982,
  "run_duration": 218.70449995994568,
  "pass": 50000,
  "exists": 50000,
  "fail": 0
}

codylane commented 4 years ago

Bummer, I was really hoping we see some failures. Thank you so much for testing this and providing the results. :)

Ok, I've got another thing we can test for and I wanted to first try using the same code that vagrant-proxyconf uses which was the test result you pasted above. That seems to be working nicely if you the running process is in the same scope.

For the next test, we will actually simulate what vagrant-proxyconf does I'll create two separate *.rb files one that contains the tempfile code and one that consumes the code in a separate file which will also be our test runner. Hopefully we will see some failures from these new tests. Let me see what I can whip up.

codylane commented 4 years ago

Actually, let's scratch the testing and go straight into full debugging. I have a feeling this maybe one that is hard to track down through testing.

That said, are you familiar with debugging in ruby? If not, maybe this can help? You would have to install the pry gem into the vagrant ruby for this to work. gem.exe install pry and gem install rb-readline would do the trick as long as you can find the path to gem.exe command.

Next we will add a few break points in the current version of the code and I assume you are using the latest version 2.0.7? If so, then the next hurdle is finding out where the heck vagrant-proxyconf is installed on disk and then you'll add the following require 'pry'; binding.pry before the line line. If all goes well, the next time you run vagrant provision assuming the VM is already running or if the VM fires up you will drop right into the debugger. What you will be looking for is the current state of the variable @docker_systemd_config which you should just be able to copy and paste in the debugger and hit enter and it should tell you it's a String and a File that exists on disk. You can test this by entering the following in the debugger. If you wouldn't mind posting the results of this session perhaps we can find the root cause and get in a fix.

@docker_systemd_config.class.to_s
File.exists?(@docker_systemd_config)

chucknelson commented 4 years ago

@codylane - I did the more low-tech version and just put a few logging statements in there :stuck_out_tongue:

...
@machine.communicate.tap do |comm|
    ###
    logger.warn("Attempting docker systemd config upload => #{@docker_systemd_config.inspect}")
    logger.warn("Does it exist? => #{File.exist?(@docker_systemd_config)}")
    logger.warn("Opening #{@docker_systemd_config}")
    my_file = File.open(@docker_systemd_config, "rb")
    logger.warn("File size => #{my_file.size}")
    logger.warn("Closing #{@docker_systemd_config}")
    my_file.close
    ###

    comm.sudo("mkdir -p /etc/systemd/system/docker.service.d")
    comm.upload(@docker_systemd_config, "/tmp/vagrant-proxyconf-docker-systemd-config")
    ...

Initial findings are...confusing! I'm still playing around, but here is what it's showing for me for a failure:

==> default: Configuring proxy for Docker...
 WARN proxyconf: Attempting docker systemd config upload => "C:/Users/D108059/AppData/Local/Temp/vagrant20200610-16676-7ntkda"
 WARN proxyconf: Does it exist? => true
 WARN proxyconf: Opening C:/Users/D108059/AppData/Local/Temp/vagrant20200610-16676-7ntkda
 WARN proxyconf: File size => 293
 WARN proxyconf: Closing C:/Users/D108059/AppData/Local/Temp/vagrant20200610-16676-7ntkda
ERROR warden: Error occurred: No such file or directory @ rb_sysopen - C:/Users/D108059/AppData/Local/Temp/vagrant20200610-16676-7ntkda

I'm thinking since we only get the string path of the tempfile, maybe it's being garbage collected suddenly before the upload? Going to keep a reference to it and see if it makes a difference...

chucknelson commented 4 years ago

@codylane - Based on some further testing, I think this all comes down to ruby garbage collecting the Tempfile because we don't hold a reference to it.

Adding a reference to the Tempfile instance seems to eliminate the failures after many runs trying to reproduce (which would usually happen at least every other run).

# Test keeping a tempfile ref
@docker_systemd_config_tempfile_ref = tempfile(systemd_config)
@docker_systemd_config = @docker_systemd_config_tempfile_ref.path

Maybe we were just lucky with ruby 2.4.9, and GC changes/improvements up to 2.6.6 cause quicker cleanup and deletion of the files.

codylane commented 4 years ago

Excellent work finding the culprit, that should be a pretty easy fix. ;) Thank you for taking the time to report your findings as well.

I also have been suspecting GC or scoping issue and our simulation code for the tests was all running in the same scope and thus just working. I probably should have mentioned this but in the refactor I have a test fixture that I created that has a different tempfile implementation that should work on windows and it uses block scope instead of running process scope. So all my tests are executed in block scope and when that block goes out of scope the file is automatically deleted.

Here's the tests fixture code:

def create_tempdir(basename="", tmpdir=nil, mode: 0, **options)
  tmpfile = Tempfile.new(basename=basename, tmpdir=tmpdir, mode: mode, **options)

  begin
    yield File.dirname(tmpfile.path)

  ensure
    tmpfile.close
    tmpfile.unlink

  end
end

and this is how I use it in my tests

create_tempfile(:content => config_json) do |temp_filename|
  expect(File.exists?(temp_filename)).to eq true
  expect(File.read(temp_filename)).to eq config_json
  expect(File.exists?(@results[:fetched_docker_client_config_json])).to eq true
  ...
end

I suspect if we use the replace tempfile creation in base.rb with the code above and then use the block notation it will also work? The only downside is that we would have to find all the tempfile references and change those to use the block notation instead. IMHO, I've had really good luck with this implementation in other languages like python and I feel it helps ensure code readability a little bit because the context manager can clean up after itself even when a exception occurs which isn't something we get with the current implementation of tempfile.

I'm curious if you have cycles to try this for me? I'm still struggling getting a windows box up and running that actually works. As soon as I open up a bash terminal on windows using 'git bash' and I type a command like vagrant status it takes like 15 seconds to come back. It's crazy, it's an Intel I3 with 8G ram and a fusion SSHD drive so it shouldn't be that slow. Ugh, windows! ;)

chucknelson commented 4 years ago

@codylane - Oh yeah, vagrant commands taking a while in Windows is pretty normal from what I've experienced. Not sure if it's because it's doing Powershell behind the scenes or what :-/

As for the solution to this...I propose something simpler and maybe more consistent with what is already being done. Can we just change @docker_systemd_config to be a the actual Tempfile reference instead of the path as a string? The only other thing to change is referencing .path in the upload method instead of just the variable.

This would be consistent with how tempfile() is being used in this same file, like here on line 95 for the client config file: https://github.com/tmatilai/vagrant-proxyconf/blob/7bcdb8471ddf15b2859cdfb33b5571c74eae3e55/lib/vagrant-proxyconf/action/configure_docker_proxy.rb#L95

I'd even argue that the variable should probably just be named @docker_client_config.

Also, even if something bad happens, the files should still be cleaned up when the process exits I think.

Does that seem OK, or do you have grander refactoring plans you'd rather do?

codylane commented 4 years ago

No problem, I like it and I agree you solution is much simpler, it's also the original implementation. I changed it, so all of this is my fault. I'll get it fixed today and release 2.0.8 without the docker proxy refactor which will be pushed to 2.0.9.

chucknelson commented 4 years ago

@codylane - Hey now, I'm just glad you're maintaining this! Thanks for all of your help! :)

codylane commented 4 years ago

You bet my friend. I'm glad to help. I made a promise that I'm going to do my part volunteering where I can help and hopefully continue to maintain code that people use and like to use that I've contributed too. I don't even use this plugin anymore but I'm happy to continue to help maintain it.

As always, we cannot do it without without the assume open source community and people like you. We would love to continue having your help with this project and I'm sure others will appreciate it as well.

I'd love to stay in touch and if you'd like to help join this project I'm more than happy to grant you some perms. We just need the blessing of the owner @tmatilai

chucknelson commented 4 years ago

@codylane - Cool - I'm definitely open to helping you all maintain when needed, whatever you all deem appropriate I'm good with!

codylane commented 4 years ago

@chucknelson - Okay, so I've got the code staged but I've got some failing tests and I'm going to step away from the computer for the evening. I was really hoping to have this out the door today but I really don't want to cause any more problems for folks and I want to confirm that I'm not just getting to anxious to fix things. I'll be on this first thing tomorrow morning MDT.

tmatilai commented 4 years ago

You bet my friend. I'm glad to help. I made a promise that I'm going to do my part volunteering where I can help and hopefully continue to maintain code that people use and like to use that I've contributed too. I don't even use this plugin anymore but I'm happy to continue to help maintain it.

Thank you so much for your work! You are amazing and very kind!

As always, we cannot do it without without the assume open source community and people like you. We would love to continue having your help with this project and I'm sure others will appreciate it as well.

👍

I'd love to stay in touch and if you'd like to help join this project I'm more than happy to grant you some perms. We just need the blessing of the owner @tmatilai

I send invitation for @chucknelson to join the group of collaborators. Welcome onboard!

I myself have found very little time and energy for OSS work lately, and me neither am using Vagrant for this plugin, so I don't even follow the issues too closely. Especially the Windows and Docker bits have been contributed by other people.

So if I don't seem to respond to even pings here, you can always send me email directly. 😊

codylane commented 4 years ago

Released 2.0.8. closing request. Thanks again for your help.

tmatilai / vagrant-proxyconf

tempfile implementation on windows doesn't always seem to work #220

Problem Statement:

Technical Review Synopsis:

Good practices

Unlink after creation

Unlink-before-close

Vagrant-proxyconf's implementation

Simulation Code

Simulation Code

Simulation Usage