Closed gpakosz closed 8 years ago
Hi Gregory, sorry for the wait and thank you for contribution! I've looked through commit, but did not yet have time to properly check it and I'll try to find time for that soon.
Hi, I rebased the branch onto the master branch.
Hi Gregory, I am sorry for such a long wait.
Few things to discuss:
I think it is better to create one cache file and not 2 or 3, otherwise it is not atomic, there are too many files and if running with different options the cache will be overwritten.
The name can be combined or concatenated from options and binaries digests.
Also I think it is better to use sub directories as in .git/objects
instead of putting everything in one directory.
I followed every single remark you made except using FSPath
because it didn't feel like an improvement to use it just to join paths and I need FileUtils.copy_file
anyways.
By now, files are stored in #{@cache_dir}/"#{digest[0..1]}/#{digest[2..-1]}
just like Git. Also, as suggested, there is now a single cached file per image which digest is computed by hashing the original file content, then updated with options and worker bins.
I made a separate commit. Once reviewed I suggest I squash the two commits before force pushing the branch one last time.
Well CI has errors unrelated to my changes, can you have a look?
Installing rack 1.6.4
An error occurred while installing mime-types-data (3.2016.0221), and Bundler
Hi Ivan, when do you think we can roll another round of review for this PR?
Hi Gregory, thanks for poking me about this. I've already looked at it and checked how it works, but did not manage to find time to write everything. Points to discuss:
write
will return if optimized
is not present. Simple solution would be to just write the original file to cache, but it looks like wasting disk space. Two other solutions if using the same path are: to create an empty file (then atomic creation of cache is a must) or create a symlink to self. What do you think?Marshal
, as strings can be just joined together.cached = @cache_dir / digest(original)
return unless cached.exist?
instead of
cached = File.join(@cache_dir, digest(original))
return unless File.exist?(cached)
Also FileUtils.copy_file
is not atomic. A method similar to ImagePath#replace
, which would create a temporary file in cache directory, copy optimized file over it and rename it to cache path would make it atomic.
Hi Ivan,
Here are my answers:
Marshal
seems like a robust way of dumping a state to a string. Again are you concerned about the performance? In which case Marshal
will be faster than manually flattening hashes / arrays and joining strings.FSPath
I implemented the changes discussed so far.
Note that the symlink only serves the purpose of avoiding going through workers again, but optimize_image
will still return nil
. This is because I don't think it's a good idea to have the behavior of optimize_image
depend on whether caching is enabled or not.
Great! I'll do my best not to delay review. And sure, the cache should not change behaviour.
Please also add description of two new options to Options section of README.markdown and to OptionParser
, so they are exposed to command line
I implemented what's been discussed except FSPath.temp_file_path
because it seems wasteful to open file handles just for the sake of having a unique name
I forgot to document the options properly, I'll do that later along with adding few tests to make sure cache_dir
is a directory if it already exists
Updated.
I also reworked the tests a bit. By now, since cache file digest is computed by updating file SHA1 with options and optionally worker digests, there's no real distinction between "cached file doesn't exist" and "options and/or worker digests don't match".
About temporary file – it is an extra file handle (even worse for old ruby version as it was creating a locking directory), but also the handle can be used with FSPath.temp_file
to write to it instead of using FileUtils.mv
which will open the handle anyway.
But also I understood that optimising identical images will do wasteful operations in parallel, so an even better solution would be to have static tmp file name (like cached
with .tmp
extension) and exclusively lock that file, check if it still has zero size, run the block, write data, rename to cached
.
I don't really understand what you don't like with the current tmp_name
private method then using FileUtils.mv
, beside the aesthetics.
Performance wise, FileUtils.mv
defers to File.rename
is both src
and dest
are on the same filesystem. Otherwise it copies the file but well if it's not the case already, this copy can be kept low level ala Java NIO.
Having a static @cache_dir / 'cached.tmp'
temp file and locking with a mutex is really a waste imho: there's no reason to impose such lock contention.
When caching is enabled, it seems natural to move the optimized file to @cache_dir
and in fact now that I think about it the ultimate choice is to use the basename of the optimized file since it's already a unique temporary file name:
tmp = @cache_dir / File.basename(optimized)
FileUtils.mv(optimized, tmp, :force => true)
File.rename(tmp, cached)
When checking the complete process, I've found a problem related to cache:
Without cache the optimize_image
method returns a wrapped temp file path which in optimize_image!
replaces the original file, so the temp file is removed as it is not needed anymore. With cache the returned file is the path in cache, so the cached file will be removed. Just try to use image_optim
binary on a folder with cache enabled and check the cache directory. I can suggest creating a class for cache path which will override the replace
method which will do the same but will not remove the file from cache.
About locking I may have been unclear:
The cache directory will be shared by multiple parallel optimisations, it can also be shared by multiple processes doing optimisation, so if an identical image is optimised in parallel, tools will be wasting cycles as they will overwrite the cache file with same result.
That is why I proposed locking (not mutex, but flock
) which will also make creating unique name unneeded. flock
is not slow (tens of thousands open'n'flock per second) especially compared to time to optimise an image.
So using something like tmp = @cache_dir / "#{digest}.tmp"
, open it with File::CREAT | File::WRONLY
and flock(File::LOCK_EX)
, repeat checks file?
and symlink?
as the cache file may have already been written and only then yield and eventually move result and rename or create a symlink.
FileUtils.mv
is indeed good performance wise, I was already thinking that the way ImagePath#replace
copies and removes file is wasteful when source and destination are on the same volume.
Are there expectations about #optimize_image
returning a wrapped temp file? If so, shouldn't it return one no matter what, which brings us back to generating a temp file name without opening a file handle.
The only place in image_optim
expecting a temp file is #optimize_image!
as it removes the path returned by #optimize_image
, but there can be a problem in external usage of #optimize_image
.
Though always copying to temp file is unneeded work for methods like #optimize_image_data
as it just reads data from file (GC takes care of removing the temp file).
Besides returning a special class with custom #replace
(which will ensure only proper internal use), I've thought about returning a hardlink or a symlink to cache path or some more complicated solutions. What do you think about always returning a temp file, but adding an option like allow_cache_path
to #optimize_image
and Cache#fetch
to return a direct cache path or is it exposing too much internal implementation?
The cache directory will be shared by multiple parallel optimisations, it can also be shared by multiple processes doing optimisation, so if an identical image is optimised in parallel, tools will be wasting cycles as they will overwrite the cache file with same result.
How common is this situation? While it can theoretically happen, I have difficulties assuming it's not a degraded case. So it feels like writing more complex code and optimizing for a case that should not happen in real life.
I was about to go with subclassing ImagePath::Optimized
but I find ImageOptim::ImagePath::Optimized
being a proxy for ImageOptim::ImagePath
a bit convoluted.
About always returning a temp file, I think I prefer keeping the CachePath
with a non-destructive #replace()
method. Client code that scavenges cache by moving files away will have to adapt.
Alright, I implemented what we discussed and rebased everything onto the master branch.
I also played with the image_optim
binary:
I agree, optimising identical images in parallel probably happens rarely and can be done much later if there will be a need of that. But then I urge you to use a temp file path for tmp
in #fetch
. Even if optimized.basename
should be a name of a temp file, creating a temp file is not an expensive operation compared to optimisation itself.
But then I urge you to use a temp file path for tmp in #fetch. Even if optimized.basename should be a name of a temp file, creating a temp file is not an expensive operation compared to optimisation itself.
I don't understand how taking optimized.basename
could clash with anything in fact, care to elaborate?
The important thing about temp file is that underlying calls ensure that it can't clash. Different implementation of Tmpname or different usage of caching can cause weird bugs.
Ok this batch of commits should give you a better impression. Last time I did my tests before rebasing and breaking everything :/
As mentioned in the comment above, now, I'm checking whether cached size is empty to decide whether it corresponds to an already optimized file. This solves moving the cache directory.
I hesitated with hashing the content of /dev/null
which produces da39a3ee5e6b4b0d3255bfef95601890afd80709
and could be a marker for a cached file name, but getting rid of symlinks makes it work on Windows
Do you mean that you wanted to create a symlink to da39a3ee5e6b4b0d3255bfef95601890afd80709
? That is a sha1 of an empty string as /dev/null
can't be read.
Currently windows is not supported by image_optim
, so this should not be a reason, but I'm also not sure about using symlinks and maybe it is not hard to add support of windows, so better to not make it harder.
Overall there should be no way for an empty file to appear when creating an entry for a succeeded optimisation, so using empty files for unoptimisable images should not be a problem. I had ideas like adding an extension to those or putting them inside a sub-folder, but this will bring more code and no much gain.
Yes I hesitated with creating a symlink to /path/to/cache_dir/da/39a3ee5e6b4b0d3255bfef95601890afd80709
but in fact that's not needed. As you noticed, the situation where image_optim
would produce zero sized files in cache should be when caching already optimized images.
Please ignore AppVeyor failing, I will try to test how far is image_optim
from working on windows.
Yep will do. I noticed it's not configured yet.
Few last points about tests:
CachePath#replace
(you can adapt one for Path#replace)CachePath
or nil
can be returned and never just Path
I'll look into it. I don't know RSpec well though so I'm not sure about what shared test groups are and how to use them
Shared examples in short:
shared_examples "does something" do
it … do
end
end
describe … do
let …
before …
it_behaves_like "does something"
end
Yep I finally found my way through it this afternoon :)
Hi Gregory, can you please rebase the branch on master, sqash-split to one or more atomic/logical commits and check if it works on appveyor?
I'll extract capabilities checkers for easy reusing, so failing windows checks can be ignored
Have a look at CapabilityCheckHelpers
, any_file_modes_allowed?
and inodes_supported?
in path_spec.rb
I updated cache_path_spec.rb
from path_spec.rb
.
What do you prefer between keeping the 10 commits from the cache
branch, and squashing everything into a single commit?
For the record, tests and rubocop are all good in every of the 10 commits. I let you decide how you prefer history to look like
Those are development commits, so better 1 commit unless you can and would like to split them into several logical separate ones. Also please add the link to this PR in changelog entry.
Is it important to link the PR? I linked #83.
By the way, why are you using [#XX](https://github.com/toy/image_optim/issues/XX)
in CHANGELOG.markdown
? Using #XX
is enough, GitHub automatically creates links.
Already not, as important, as you've just linked them by mentioning ;)
I already wanted to check it for some time and thank you for pushing me to do it :), but unfortunately it works only in comments/descriptions and some other stuff, but not in markdown files: https://github.com/toy/image_optim/blob/a4c2767f7b088f9ca091fd33ba04f9e772f805a7/CHANGELOG.markdown https://github.com/toy/image_optim/blob/62c85009e3eea006ad47bd5cf0d032d3215ece5f/CHANGELOG.markdown https://github.com/toy/image_optim/blob/657ed3f1d634608a4b02ef54eeaa50f92b029eb0/CHANGELOG.markdown
Oh indeed you're right
Thank you for finishing this long PR, I'll release a new version soon
You're welcome. Thanks for the review and the suggestions. That was nice.
:boom:
implemented #83