added cache_dir and cache_workers options to cache results

gpakosz commented 8 years ago

:boom:

implemented #83

toy commented 8 years ago

Hi Gregory, sorry for the wait and thank you for contribution! I've looked through commit, but did not yet have time to properly check it and I'll try to find time for that soon.

gpakosz commented 8 years ago

Hi, I rebased the branch onto the master branch.

toy commented 8 years ago

Hi Gregory, I am sorry for such a long wait.

Few things to discuss: I think it is better to create one cache file and not 2 or 3, otherwise it is not atomic, there are too many files and if running with different options the cache will be overwritten. The name can be combined or concatenated from options and binaries digests. Also I think it is better to use sub directories as in .git/objects instead of putting everything in one directory.

gpakosz commented 8 years ago

I followed every single remark you made except using FSPath because it didn't feel like an improvement to use it just to join paths and I need FileUtils.copy_file anyways.

By now, files are stored in #{@cache_dir}/"#{digest[0..1]}/#{digest[2..-1]} just like Git. Also, as suggested, there is now a single cached file per image which digest is computed by hashing the original file content, then updated with options and worker bins.

I made a separate commit. Once reviewed I suggest I squash the two commits before force pushing the branch one last time.

gpakosz commented 8 years ago

Well CI has errors unrelated to my changes, can you have a look?

Installing rack 1.6.4
An error occurred while installing mime-types-data (3.2016.0221), and Bundler

gpakosz commented 8 years ago

Hi Ivan, when do you think we can roll another round of review for this PR?

toy commented 8 years ago

Hi Gregory, thanks for poking me about this. I've already looked at it and checked how it works, but did not manage to find time to write everything. Points to discuss:

Currently an already optimised file will not be cached as write will return if optimized is not present. Simple solution would be to just write the original file to cache, but it looks like wasting disk space. Two other solutions if using the same path are: to create an empty file (then atomic creation of cache is a must) or create a symlink to self. What do you think?
Is binary sha1 really better or the bin version is enough?
There is no need for Marshal, as strings can be just joined together.
Why FSPath:

cached = @cache_dir / digest(original)
return unless cached.exist?

instead of

cached = File.join(@cache_dir, digest(original))
return unless File.exist?(cached)

Also FileUtils.copy_file is not atomic. A method similar to ImagePath#replace, which would create a temporary file in cache directory, copy optimized file over it and rename it to cache path would make it atomic.

gpakosz commented 8 years ago

Hi Ivan,

Here are my answers:

I think I would go for a symlink
Using the binary SHA1 seems more robust in case a binary fails to update its version information properly. Are you concerned about performance when computing SHA1s?
Marshal seems like a robust way of dumping a state to a string. Again are you concerned about the performance? In which case Marshal will be faster than manually flattening hashes / arrays and joining strings.
I'll look into FSPath

toy commented 8 years ago

Agreed.
I was thinking more about two bins of same version compiled differently. So to me it seems like deciding between been a bit too precise and a bit not enough precise.
I'm concerned about marshal version, but also has not a lot of purpose when there are already just strings. It is also important to ensure that the final strings to digest can't be different for identical options/bins due to order.

gpakosz commented 8 years ago

I implemented the changes discussed so far.

Note that the symlink only serves the purpose of avoiding going through workers again, but optimize_image will still return nil. This is because I don't think it's a good idea to have the behavior of optimize_image depend on whether caching is enabled or not.

toy commented 8 years ago

Great! I'll do my best not to delay review. And sure, the cache should not change behaviour.

toy commented 8 years ago

Please also add description of two new options to Options section of README.markdown and to OptionParser, so they are exposed to command line

gpakosz commented 8 years ago

I implemented what's been discussed except FSPath.temp_file_path because it seems wasteful to open file handles just for the sake of having a unique name

gpakosz commented 8 years ago

I forgot to document the options properly, I'll do that later along with adding few tests to make sure cache_dir is a directory if it already exists

gpakosz commented 8 years ago

Updated.

I also reworked the tests a bit. By now, since cache file digest is computed by updating file SHA1 with options and optionally worker digests, there's no real distinction between "cached file doesn't exist" and "options and/or worker digests don't match".

toy commented 8 years ago

About temporary file – it is an extra file handle (even worse for old ruby version as it was creating a locking directory), but also the handle can be used with FSPath.temp_file to write to it instead of using FileUtils.mv which will open the handle anyway. But also I understood that optimising identical images will do wasteful operations in parallel, so an even better solution would be to have static tmp file name (like cached with .tmp extension) and exclusively lock that file, check if it still has zero size, run the block, write data, rename to cached.

gpakosz commented 8 years ago

I don't really understand what you don't like with the current tmp_name private method then using FileUtils.mv, beside the aesthetics.

Performance wise, FileUtils.mv defers to File.rename is both src and dest are on the same filesystem. Otherwise it copies the file but well if it's not the case already, this copy can be kept low level ala Java NIO.

Having a static @cache_dir / 'cached.tmp' temp file and locking with a mutex is really a waste imho: there's no reason to impose such lock contention.

When caching is enabled, it seems natural to move the optimized file to @cache_dir and in fact now that I think about it the ultimate choice is to use the basename of the optimized file since it's already a unique temporary file name:

    tmp = @cache_dir / File.basename(optimized)
    FileUtils.mv(optimized, tmp, :force => true)
    File.rename(tmp, cached)

toy commented 8 years ago

When checking the complete process, I've found a problem related to cache: Without cache the optimize_image method returns a wrapped temp file path which in optimize_image! replaces the original file, so the temp file is removed as it is not needed anymore. With cache the returned file is the path in cache, so the cached file will be removed. Just try to use image_optim binary on a folder with cache enabled and check the cache directory. I can suggest creating a class for cache path which will override the replace method which will do the same but will not remove the file from cache.

About locking I may have been unclear: The cache directory will be shared by multiple parallel optimisations, it can also be shared by multiple processes doing optimisation, so if an identical image is optimised in parallel, tools will be wasting cycles as they will overwrite the cache file with same result. That is why I proposed locking (not mutex, but flock) which will also make creating unique name unneeded. flock is not slow (tens of thousands open'n'flock per second) especially compared to time to optimise an image. So using something like tmp = @cache_dir / "#{digest}.tmp", open it with File::CREAT | File::WRONLY and flock(File::LOCK_EX), repeat checks file? and symlink? as the cache file may have already been written and only then yield and eventually move result and rename or create a symlink.

FileUtils.mv is indeed good performance wise, I was already thinking that the way ImagePath#replace copies and removes file is wasteful when source and destination are on the same volume.

gpakosz commented 8 years ago

Are there expectations about #optimize_image returning a wrapped temp file? If so, shouldn't it return one no matter what, which brings us back to generating a temp file name without opening a file handle.

toy commented 8 years ago

The only place in image_optim expecting a temp file is #optimize_image! as it removes the path returned by #optimize_image, but there can be a problem in external usage of #optimize_image. Though always copying to temp file is unneeded work for methods like #optimize_image_data as it just reads data from file (GC takes care of removing the temp file). Besides returning a special class with custom #replace (which will ensure only proper internal use), I've thought about returning a hardlink or a symlink to cache path or some more complicated solutions. What do you think about always returning a temp file, but adding an option like allow_cache_path to #optimize_image and Cache#fetch to return a direct cache path or is it exposing too much internal implementation?

gpakosz commented 8 years ago

The cache directory will be shared by multiple parallel optimisations, it can also be shared by multiple processes doing optimisation, so if an identical image is optimised in parallel, tools will be wasting cycles as they will overwrite the cache file with same result.

How common is this situation? While it can theoretically happen, I have difficulties assuming it's not a degraded case. So it feels like writing more complex code and optimizing for a case that should not happen in real life.

I was about to go with subclassing ImagePath::Optimized but I find ImageOptim::ImagePath::Optimized being a proxy for ImageOptim::ImagePath a bit convoluted.

gpakosz commented 8 years ago

About always returning a temp file, I think I prefer keeping the CachePath with a non-destructive #replace() method. Client code that scavenges cache by moving files away will have to adapt.

gpakosz commented 8 years ago

Alright, I implemented what we discussed and rebased everything onto the master branch.

I also played with the image_optim binary:

invoking it in a directory full of images populates the cache correctly
when invoking it twice in the same directory, the second invocation does nothing (when caching is enabled)

toy commented 8 years ago

I agree, optimising identical images in parallel probably happens rarely and can be done much later if there will be a need of that. But then I urge you to use a temp file path for tmp in #fetch. Even if optimized.basename should be a name of a temp file, creating a temp file is not an expensive operation compared to optimisation itself.

gpakosz commented 8 years ago

But then I urge you to use a temp file path for tmp in #fetch. Even if optimized.basename should be a name of a temp file, creating a temp file is not an expensive operation compared to optimisation itself.

I don't understand how taking optimized.basename could clash with anything in fact, care to elaborate?

toy commented 8 years ago

The important thing about temp file is that underlying calls ensure that it can't clash. Different implementation of Tmpname or different usage of caching can cause weird bugs.

gpakosz commented 8 years ago

Ok this batch of commits should give you a better impression. Last time I did my tests before rebasing and breaking everything :/

gpakosz commented 8 years ago

As mentioned in the comment above, now, I'm checking whether cached size is empty to decide whether it corresponds to an already optimized file. This solves moving the cache directory.

I hesitated with hashing the content of /dev/null which produces da39a3ee5e6b4b0d3255bfef95601890afd80709 and could be a marker for a cached file name, but getting rid of symlinks makes it work on Windows

toy commented 8 years ago

Do you mean that you wanted to create a symlink to da39a3ee5e6b4b0d3255bfef95601890afd80709? That is a sha1 of an empty string as /dev/null can't be read. Currently windows is not supported by image_optim, so this should not be a reason, but I'm also not sure about using symlinks and maybe it is not hard to add support of windows, so better to not make it harder.

Overall there should be no way for an empty file to appear when creating an entry for a succeeded optimisation, so using empty files for unoptimisable images should not be a problem. I had ideas like adding an extension to those or putting them inside a sub-folder, but this will bring more code and no much gain.

gpakosz commented 8 years ago

Yes I hesitated with creating a symlink to /path/to/cache_dir/da/39a3ee5e6b4b0d3255bfef95601890afd80709 but in fact that's not needed. As you noticed, the situation where image_optim would produce zero sized files in cache should be when caching already optimized images.

toy commented 8 years ago

Please ignore AppVeyor failing, I will try to test how far is image_optim from working on windows.

gpakosz commented 8 years ago

Yep will do. I noticed it's not configured yet.

toy commented 8 years ago

Few last points about tests:

Please add a test for CachePath#replace (you can adapt one for Path#replace)
Test groups for "when cache is enabled" with and without worker digests are very similar, maybe a shared test group?
Better to check that only CachePath or nil can be returned and never just Path
Marking cached image as already optimised is not tested
And behaviour when zero size cache file exists

gpakosz commented 8 years ago

I'll look into it. I don't know RSpec well though so I'm not sure about what shared test groups are and how to use them

toy commented 8 years ago

Shared examples in short:

shared_examples "does something" do
  it … do
  end
end

describe … do
  let …
  before …
  it_behaves_like "does something"
end

gpakosz commented 8 years ago

Yep I finally found my way through it this afternoon :)

toy commented 8 years ago

Hi Gregory, can you please rebase the branch on master, sqash-split to one or more atomic/logical commits and check if it works on appveyor?

toy commented 8 years ago

I'll extract capabilities checkers for easy reusing, so failing windows checks can be ignored

toy commented 8 years ago

Have a look at CapabilityCheckHelpers, any_file_modes_allowed? and inodes_supported? in path_spec.rb

gpakosz commented 8 years ago

I updated cache_path_spec.rb from path_spec.rb.

gpakosz commented 8 years ago

What do you prefer between keeping the 10 commits from the cache branch, and squashing everything into a single commit?

For the record, tests and rubocop are all good in every of the 10 commits. I let you decide how you prefer history to look like

toy commented 8 years ago

Those are development commits, so better 1 commit unless you can and would like to split them into several logical separate ones. Also please add the link to this PR in changelog entry.

gpakosz commented 8 years ago

Is it important to link the PR? I linked #83.

By the way, why are you using [#XX](https://github.com/toy/image_optim/issues/XX) in CHANGELOG.markdown? Using #XX is enough, GitHub automatically creates links.

toy commented 8 years ago

Already not, as important, as you've just linked them by mentioning ;)

I already wanted to check it for some time and thank you for pushing me to do it :), but unfortunately it works only in comments/descriptions and some other stuff, but not in markdown files: https://github.com/toy/image_optim/blob/a4c2767f7b088f9ca091fd33ba04f9e772f805a7/CHANGELOG.markdown https://github.com/toy/image_optim/blob/62c85009e3eea006ad47bd5cf0d032d3215ece5f/CHANGELOG.markdown https://github.com/toy/image_optim/blob/657ed3f1d634608a4b02ef54eeaa50f92b029eb0/CHANGELOG.markdown

gpakosz commented 8 years ago

Oh indeed you're right

toy commented 8 years ago

Thank you for finishing this long PR, I'll release a new version soon

gpakosz commented 8 years ago

You're welcome. Thanks for the review and the suggestions. That was nice.

toy / image_optim

added cache_dir and cache_workers options to cache results #115