livingsocial / rake-pipeline

An extension to Rake for dealing with a directory of inputs, a number of filters, and a directory of outputs
MIT License
276 stars 38 forks source link

Rake tasks are invoked when they shouldn't be. #95

Closed ahawkins closed 12 years ago

ahawkins commented 12 years ago

I've been chasing down an issue related to "caching" in my application's pipeline. I think I've finally found the problem. I'm assuming this is true: I build the pipeline. I don't change anything. Building the pipeline again should "skip" everything since no input files have changed. IE: only rebuild the pipeline if input files have changed.

This behavior is accomplished through Rake::FileTasks. Each filter in the pipeline is a rake rask that setups up a task with preqreqs (the input files). Rake::FileTask decides if a task should be invoked by looking at itself and it's preqreqs. If anyone of those need to be invoked it will be done. Rake::FileTask simply checks File.mtime to see if a file as changed. It compares the preqreq timestamp to the file itself. If the prereq is newer than the file itself then the task needs to be invoked again. This is the call to super in the Rake::Pipeline::DynamicFileTask#needed?. The class also does it's own logic but that's unimportant because it will return if super is true.

Internally all files in the build process are copied into a tmp directory. This make sense. However there is a problem. The tmp directory is unique per process. The variable is incremented automatically every time the method is called. Here's the link to the method: https://github.com/livingsocial/rake-pipeline/blob/master/lib/rake-pipeline.rb#L389. This means that isolated builds of the same pipeline (inputs and assetfile) will generate the same number of temp directories. The intermediate build files should be the same and be skipped accordingly. The problem occurs when multiple build occur from the same process. This is the exact use case for the preview server--however I'm not sure if this the intended use case.

Building the pipeline multiple times in the same process will continually increment the class variable. The second time the pipeline is invoked, files will be placed into new temporary directories. The dependencies are also wired up. However since a new file is created every time, File.mtime will always be newer than before causing rake tasks to be invoked when they shouldn't. I think this is a major problem.

Here is my debugging session from deep inside rake to find out what bit of code is actually making that decision. It's from inside this method: https://github.com/jimweirich/rake/blob/master/lib/rake/file_task.rb#L32

(rdb:1) name
# output from first run of the pipeline
"/Users/adam/radium/frontend/tmp/builds/javascript/vendor.js"
# now its timestamp (it's more in the past because it was built in the first build)
(rdb:1) timestamp
2012-10-11 22:01:50 +0200

# it's prereq dumped into a fresh new tmp directory
(rdb:1) application[prereq, @scope].name
"/Users/adam/radium/frontend/tmp/rake-pipeline-e482cc0c93b30eef7b1d05fc2b90385121075db2/rake-pipeline-tmp-26/vendor.js"

# and its timestamp, this now fails the comparison and everything is built again.
(rdb:1) application[prereq, @scope].timestamp
2012-10-11 22:01:52 +0200

Here is the assetfile used in my project: https://github.com/radiumsoftware/iridium/blob/master/lib/iridium/Assetfile

@dudleyf @wycats can you confirm this is a bug?

ahawkins commented 12 years ago

I've done more digging into this issue and have discovered that there are 3 real issues at play here. They are all triggered by having multiple pipelines in a project (or input blocks in the Assetfile).

The major issue is stated in the original post in this thread. This is a two step fix. Initially it may seem trivial to simply reset @@tmp_id with every call to invoke_clean. This doesn't work because invoke_clean may be called multiple in the compilation of one project. Declaring an input block in the Assetfile creates a new pipeline in the project. Project#invoke_clean loops over all the pipelines and calls invoke_clean. This is a problem because temporary directories are not unique to each pipeline. This problem can be fixed by one refactoring: move temporary directory generation to pipeline instance. Here is the problem method: https://github.com/livingsocial/rake-pipeline/blob/master/lib/rake-pipeline.rb#L389. This creates a secondary problem. Temporary directory generation must be unique to the pipeline and across processes.. This was previously accomplished by keeping @@tmp_id in a class variable and incrementing it at every call. The solution here is to introduce a "fingerprint" for each pipeline in the project. A hash of all the input files can be used as a fingerprint. @@tmp_id should become @tmp_id and be incremented on every call as normal. @tmp_id should be reset back to 0 when invoke clean is called. This ensures that every pipeline will generate the exact same tmp directory progression on every invocation in every process. Here is a patch:

 def fingerprint
      @fingerprint ||= Digest::SHA1.hexdigest(eligible_input_files.collect(&:fullpath).join("-"))[0..7]
    end
    # Generate a new temporary directory name.
    #
    # @return [String] a unique temporary directory name
    def generate_tmpname
      "pipeline-tmp-#{fingerprint}-#{@tmp_id += 1}"
    end

Hash generation may be slow, but it will work as an initial case and patch. This method should be reworked to be faster.

Solving problem #1 means the super call in DynamicFileTask#needed? will return true when it should (detailed in the initial post). We must know turn our focus to the #needed? method. https://github.com/livingsocial/rake-pipeline/blob/master/lib/rake-pipeline/dynamic_file_task.rb#L66. There is a logic error in this method as it is currently. First of all, not all tasks will actually be dynamic. For example: a concat is registered as a dynamic task but it has no dynamic dependencies. This implementation detail cascades down through this method. I proposed a new method: dynamic? Its purpose is to return true if the associated filter actually has dynamic dependencies. The DynamicFileTask currently only checks for the dynamic block to see if it's actually dynamic. This does not work. Every single task registers a dynamic block. This behavior is seen here: https://github.com/livingsocial/rake-pipeline/blob/master/lib/rake-pipeline/filter.rb#L197. This will trigger logic branches inside the DynamicFileTask class when they should not be. concat or copy filters have no dependencies. Here's the new dynamic? method.

def dynamic?
  has_dynamic_block? && !invoke_dynamic_block.empty?
end

The #needed? method should be refactored as such:

def needed?
 # tasks without dynamic deps are just regular file tasks. Act accordingly. 
 return super unless dynamic?

  # if we have no manifest, this file task is needed
  return true !last_manifest_entry

  # If any of this task's dynamic dependencies have changed,
  # this file task is needed
  last_manifest_entry.deps.each do |dep, time|
    if File.mtime(dep) > time
      return true
    end
  end

  # Otherwise, it's not needed
  false
end

This solves the logic errors. At this point #needed? is working almost correctly. The 3rd issue is related to manifest generation.

There is an issue with manifest and multiple pipelines. The problem is here: https://github.com/livingsocial/rake-pipeline/blob/master/lib/rake-pipeline/project.rb#L130. The manifest is actually written at the end of all pipelines. This means that calls to last_manifest_entry inside DynamicFileTask#needed? will return false in cases with multiple pipelines. The manifest needs to be moved to be pipeline specific and written at the end of the invoke method. The manifest file should be written with the same fingerprint. I will demonstrate this problem through the Iridium Assetfile found here: https://github.com/radiumsoftware/iridium/blob/master/lib/iridium/Assetfile. The last step in entire pipeline is copy everything. There are previous pipelines with dynamic dependencies (sass). Invoking the pipeline once will generate one manifest.json. The contents correspond (incorrectly IMO) to the all the inputs in the final pipeline. This is all the optimized assets in iridium's case. Invoking the pipeline again creates a problem. When the sass filter hits it correctly and attemps to read the manifest. However the manifest is incorrect because it was generated from a different pipeline. I have not looked into exactly how to do this refactoring yet.

All in all, all the issues can be resolved by moving genration of temporary directory to the instance level, fixing #needed? with #dynamic? as described above, and making manifests a pipeline concern instead of a project concern. I think that implementing these three steps will make the pipeline work as many people are expecting.

@wycats there is your epic update.

wagenet commented 12 years ago

@twinturbo should this be closed since #99 was merged?

ahawkins commented 12 years ago

@wagenet ya you can close this

wagenet commented 12 years ago

@twinturbo I can't close it. Don't have admin :( Can't you close tickets you made?

ahawkins commented 12 years ago

@wagenet I totally forgot that I made this ticket. I got used to just asking you to close it. Anyways #117 closes this.