samg / diffy

Easy Diffing in Ruby
http://rubygems.org/gems/diffy
MIT License
1.27k stars 104 forks source link

Lots of tempfiles being left around #39

Closed dkowis closed 10 years ago

dkowis commented 10 years ago

We ended up filling up the /tmp partition on a production server, seems as it creates a whole bunch of temp files that aren't cleaned up.

I don't have specific details, but I bet I could reproduce it by doing lots and lots of diffs in a short time period (which is something I'm doing.)

jonkelleyatrackspace commented 10 years ago

From ruby 1.9.3 documentation: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/tempfile/rdoc/Tempfile.html From ruby docs, they suggest calling unlink in an ensure block.

e,g,

file = Tempfile.new('my_temp')
begin
   ...do something with file...
ensure
   file.close
   file.unlink   # deletes the temp file so your /tmp directory doesn't fill up
end

I could be wrong, but from diffy source, it seems like diffy has an explicit design to not unlink / remove old resources. I do not see an .unlink method called anywhere within the source tree. I suggest not using this in production until this is fixed. :( If you have to, better setup a cronjob.

Diffy source (an unlink method needs to be added):

def tempfile(string)
  t = Tempfile.new('diffy')
  # ensure tempfiles aren't unlinked when GC runs by maintaining a
  # reference to them.
  @tempfiles ||=[]
  @tempfiles.push(t)
  t.print(string)
  t.flush
  t.close
  t.path
end

Unlink comes from the POSIX API. unlink() deletes a name from the filesystem. If that name was the last link to a file and no processes have the file open the file is deleted and the space it was using is made available for reuse.

If the name was the last link to a file but any processes still have the file open the file will remain in existence until the last file descriptor referring to it is closed.

If the name referred to a socket, fifo or device the name for it is removed but processes which have the object open may continue to use it.

http://www.tutorialspoint.com/unix_system_calls/unlink.htm

jonkelleyatrackspace commented 10 years ago

There's a class out there that makes it better. Called Better::Tempfile. http://better.rubyforge.org/classes/Better/Tempfile.html

Seems ruby tempfile stdlib has issues.

Ruby 1.8’s version can generate “weird” path names that can confuse certain command line tools such as Curl. Better::Tempfile is based on Ruby 1.9’s version and generates saner filenames. Ruby 1.8’s version has a bug which makes unlink-before-close (as described below) unusable: it raises an an exception when close is called if the tempfile was unlinked before. Ruby 1.9.1’s version closes the file when unlink is called. This makes unlink-before-close unusable. Ruby’s bundled version deletes the temporary file in its finalizer, even when unlink was called before. As a result it may potentially delete other Ruby processes’ temp files when it’s not supposed to.

samg commented 10 years ago

I initially wrote Diffy in the 1.8 days when tempfiles were automatically unlinked by a finalizer, and it seemed somewhat dangerous to attempt to unlink them explicitly. Seems like this behavior has changed (presumably for the better in more recent Rubies) and that Diffy should be updated to explicitly remove the tempfiles it creates after it no longer needs them.

I'll look into making this change, though I'd also be happy to accept a pull request.

dkowis commented 10 years ago

Interestingly enough @samg, we're running in 1.8 still, and it's still happening. It may be because we're using passenger, and this stuff is part of a display rendering, and it's LOTS of diffs.

It probably fits somewhat outside the original use case you had for this thing :)

samg commented 10 years ago

yeah, there's definitely some weirdness around how ruby handles Tempfiles.

In theory the files should be cleaned up when GC runs on the tempfile objects (i.e. after the Diffy::Diff objects are de-referenced and GC'd). It's possible that due to the nature of your app, these diffs are staying in memory and thus the unlink is never occurring (or only occurring after it's caused a system problem). It's also possible that there's some other reason that the "magic" unlinking is failing (e.g. processes being shutdown or killed before the finalizers can run).

I think a better approach would be for Diffy to clean up the files explicitly (as you suggest) after it's generated the diff content, being careful to catch any errors that result from files that have already been removed.

I wrote Diffy a while back for a small personal project and never used it at any significant scale. It's gained quite a bit of popularity since then, so I'd like to make sure it's a reasonably responsible citizen in others' production environments. ;-)

samg commented 10 years ago

I've just pushed version 3.0.3 which I believe will prevent the issue you experienced from occurring (it's no longer necessary to wait for GC for tempfiles to be deleted, rather they'll be unlinked as soon as the diff has been generated).

Let me know if you experience any other issues, or if this doesn't resolve the problem, and thanks for reporting this.

dkowis commented 10 years ago

@jonkelleyatrackspace Woot! Lets get this badboy into prod.