pyvips results in pillow-perf benchmarks

homm commented 7 years ago

Hi, @jcupitt!

I'd like to ask you to review pyvips results in pillow-perf benchmarks.

Originally in this benchmark, there was one test for one operation. As VIPS works differently, I've added "Full operations cycle" competition with several operations in each test (from two to five). To see index and numbers, hover the graphs. Please note, all tests are run in a single thread, the explanation of this is given at the start of the page.

Results browser

At first, I'd like you to check the benchmark code. Is it correct and is it equivalent to Pillow code?

Benchmarks repository and readme Pyvips code Pillow code

If you uncomment the last line, you could compare results visually. They are almost identical except small shift for VIPS, I already seen the issue about this.

There are results for three systems: Macbook with Core i5, EC2 c3.large and EC2 c4.large. On Macbook I have libvips 8.5.7 from sources with the following configuration:

$ CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 ./configure --without-python

``` checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /bin/mkdir -p checking for gawk... no checking for mawk... mawk checking whether make sets $(MAKE)... yes checking whether make supports nested variables... yes checking for pkg-config... /usr/bin/pkg-config checking pkg-config is at least version 0.9.0... yes checking for gobject-introspection... no checking build system type... x86_64-pc-linux-gnu checking host system type... x86_64-pc-linux-gnu checking for needs -lstdc++... yes checking for native Win32... no checking for binary open needed... no checking for Mac OS X... no checking for style of include used by make... GNU checking for gcc... gcc checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ISO C89... none needed checking whether gcc understands -c and -o together... yes checking dependency style of gcc... gcc3 checking for special C compiler options needed for large files... no checking for _FILE_OFFSET_BITS value needed for large files... no checking for gawk... (cached) mawk checking for gcc... (cached) gcc checking whether we are using the GNU C compiler... (cached) yes checking whether gcc accepts -g... (cached) yes checking for gcc option to accept ISO C89... (cached) none needed checking whether gcc understands -c and -o together... (cached) yes checking dependency style of gcc... (cached) gcc3 checking for gcc option to accept ISO C99... none needed checking for gcc option to accept ISO Standard C... (cached) none needed checking for g++... g++ checking whether we are using the GNU C++ compiler... yes checking whether g++ accepts -g... yes checking dependency style of g++... gcc3 checking for an ANSI C-conforming const... yes checking for C/C++ restrict keyword... __restrict checking for ranlib... ranlib checking whether ln -s works... yes checking if malloc debugging is wanted... no checking how to run the C preprocessor... gcc -E checking for grep that handles long lines and -e... /bin/grep checking for egrep... /bin/grep -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking locale.h usability... yes checking locale.h presence... yes checking for locale.h... yes checking for LC_MESSAGES... yes checking libintl.h usability... yes checking libintl.h presence... yes checking for libintl.h... yes checking for ngettext in libc... yes checking for dgettext in libc... yes checking for bind_textdomain_codeset... yes checking for msgfmt... /usr/bin/msgfmt checking for dcgettext... yes checking if msgfmt accepts -c... yes checking for gmsgfmt... /usr/bin/msgfmt checking for xgettext... /usr/bin/xgettext checking for catalogs to be installed... en_GB de checking for dirent.h that defines DIR... yes checking for library containing opendir... none required checking for ANSI C header files... (cached) yes checking errno.h usability... yes checking errno.h presence... yes checking for errno.h... yes checking math.h usability... yes checking math.h presence... yes checking for math.h... yes checking fcntl.h usability... yes checking fcntl.h presence... yes checking for fcntl.h... yes checking limits.h usability... yes checking limits.h presence... yes checking for limits.h... yes checking for stdlib.h... (cached) yes checking for string.h... (cached) yes checking sys/file.h usability... yes checking sys/file.h presence... yes checking for sys/file.h... yes checking sys/ioctl.h usability... yes checking sys/ioctl.h presence... yes checking for sys/ioctl.h... yes checking sys/param.h usability... yes checking sys/param.h presence... yes checking for sys/param.h... yes checking sys/time.h usability... yes checking sys/time.h presence... yes checking for sys/time.h... yes checking sys/mman.h usability... yes checking sys/mman.h presence... yes checking for sys/mman.h... yes checking for sys/types.h... (cached) yes checking for sys/stat.h... (cached) yes checking for unistd.h... (cached) yes checking io.h usability... no checking io.h presence... no checking for io.h... no checking direct.h usability... no checking direct.h presence... no checking for direct.h... no checking windows.h usability... no checking windows.h presence... no checking for windows.h... no checking for dllwrap... no checking for dlltool... dlltool checking for objdump... objdump checking for ranlib... (cached) ranlib checking for strip... strip checking for ar... ar checking for as... as checking for ld... ld checking how to print strings... printf checking for a sed that does not truncate output... /bin/sed checking for fgrep... /bin/grep -F checking for ld used by gcc... ld checking if the linker (ld) is GNU ld... yes checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B checking the name lister (/usr/bin/nm -B) interface... BSD nm checking the maximum length of command line arguments... 1572864 checking how to convert x86_64-pc-linux-gnu file names to x86_64-pc-linux-gnu format... func_convert_file_noop checking how to convert x86_64-pc-linux-gnu file names to toolchain format... func_convert_file_noop checking for ld option to reload object files... -r checking for objdump... (cached) objdump checking how to recognize dependent libraries... pass_all checking for dlltool... (cached) dlltool checking how to associate runtime and link libraries... printf %s\n checking for archiver @FILE support... @ checking for strip... (cached) strip checking for ranlib... (cached) ranlib checking command to parse /usr/bin/nm -B output from gcc object... ok checking for sysroot... no checking for a working dd... /bin/dd checking how to truncate binary pipes... /bin/dd bs=4096 count=1 checking for mt... mt checking if mt is a manifest tool... no checking for dlfcn.h... yes checking for objdir... .libs checking if gcc supports -fno-rtti -fno-exceptions... no checking for gcc option to produce PIC... -fPIC -DPIC checking if gcc PIC flag -fPIC -DPIC works... yes checking if gcc static flag -static works... yes checking if gcc supports -c -o file.o... yes checking if gcc supports -c -o file.o... (cached) yes checking whether the gcc linker (ld -m elf_x86_64) supports shared libraries... yes checking whether -lc should be explicitly linked in... no checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking whether stripping libraries is possible... yes checking if libtool supports shared libraries... yes checking whether to build shared libraries... yes checking whether to build static libraries... yes checking how to run the C++ preprocessor... g++ -E checking for ld used by g++... ld -m elf_x86_64 checking if the linker (ld -m elf_x86_64) is GNU ld... yes checking whether the g++ linker (ld -m elf_x86_64) supports shared libraries... yes checking for g++ option to produce PIC... -fPIC -DPIC checking if g++ PIC flag -fPIC -DPIC works... yes checking if g++ static flag -static works... yes checking if g++ supports -c -o file.o... yes checking if g++ supports -c -o file.o... (cached) yes checking whether the g++ linker (ld -m elf_x86_64) supports shared libraries... yes checking dynamic linker characteristics... (cached) GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking for an ANSI C-conforming const... (cached) yes checking for mode_t... yes checking for off_t... yes checking for size_t... yes checking for working memcmp... yes checking for stdlib.h... (cached) yes checking for unistd.h... (cached) yes checking for sys/param.h... (cached) yes checking for getpagesize... yes checking for working mmap... yes checking for vprintf... yes checking for _doprnt... no checking for getcwd... yes checking for gettimeofday... yes checking for getwd... yes checking for memset... yes checking for munmap... yes checking for putenv... yes checking for realpath... yes checking for strcasecmp... yes checking for strchr... yes checking for strcspn... yes checking for strdup... yes checking for strerror... yes checking for strrchr... yes checking for strspn... yes checking for vsnprintf... yes checking for realpath... (cached) yes checking for mkstemp... yes checking for mktemp... yes checking for random... yes checking for rand... yes checking for sysconf... yes checking for atexit... yes checking for cbrt in -lm... yes checking for hypot in -lm... yes checking for atan2 in -lm... yes checking for REQUIRED... yes checking for CONTEXT_GET_HELP... yes checking for MONOTONIC... yes checking for THREADS... yes checking for TYPE_INIT... no checking for TYPE_INIT... yes checking for gtk-doc... no configure: WARNING: You will not be able to create source packages with 'make dist' because gtk-doc >= 1.14 is not found. checking for gtkdoc-check... no checking for gtkdoc-check... no checking for gtkdoc-rebase... no checking for gtkdoc-mkpdf... no checking whether to build gtk-doc documentation... no checking for GTKDOC_DEPS... yes checking for XML_ParserCreate in -lexpat... yes checking expat.h usability... yes checking expat.h presence... yes checking for expat.h... yes checking for GSF... yes checking for GSF_ZIP64... yes checking for FFTW... yes checking for MAGICK_WAND... yes checking for MAGICK... no checking for MAGICK... yes checking for SetImageOption... yes checking for MagickCoreGenesis... yes checking for ResetImagePropertyIterator... yes checking for ResetImageAttributeIterator... yes checking for GetVirtualPixels... yes checking for struct _ImageInfo.number_scenes... yes checking for ORC... yes checking for orc_program_get_error... yes checking for LCMS... yes checking for OPENEXR... yes checking for POPPLER... yes checking for RSVG... yes checking for X... libraries , headers checking for gethostbyname... yes checking for connect... yes checking for remove... yes checking for shmat... yes checking for IceConnectionNumber in -lICE... yes checking for ZLIB... yes checking for OPENSLIDE... yes checking for MATIO... yes checking for CFITSIO... yes checking for LIBWEBP... yes checking for LIBWEBPMUX... no configure: WARNING: libwebpmux not found; disabling webp metadata support checking for PANGOFT2... yes configure: WARNING: pygobject-3.0 not found; disabling vips8 python support checking for TIFF... yes checking for giflib... libraries -lgif, headers in default path checking for PNG... yes checking for JPEG... libraries -ljpeg, headers in default path checking for jpeg_c_bool_param_supported... no checking for EXIF... yes checking exif-data.h usability... yes checking exif-data.h presence... yes checking for exif-data.h... yes checking that generated files are newer than configure... done configure: creating ./config.status config.status: creating vips.pc config.status: creating vipsCC.pc config.status: creating vips-cpp.pc config.status: creating Makefile config.status: creating libvips/include/vips/version.h config.status: creating libvips/include/Makefile config.status: creating libvips/include/vips/Makefile config.status: creating libvips/Makefile config.status: creating libvips/arithmetic/Makefile config.status: creating libvips/colour/Makefile config.status: creating libvips/conversion/Makefile config.status: creating libvips/convolution/Makefile config.status: creating libvips/deprecated/Makefile config.status: creating libvips/foreign/Makefile config.status: creating libvips/freqfilt/Makefile config.status: creating libvips/histogram/Makefile config.status: creating libvips/draw/Makefile config.status: creating libvips/iofuncs/Makefile config.status: creating libvips/morphology/Makefile config.status: creating libvips/mosaicing/Makefile config.status: creating libvips/create/Makefile config.status: creating libvips/resample/Makefile config.status: creating libvips/video/Makefile config.status: creating libvipsCC/include/Makefile config.status: creating libvipsCC/include/vips/Makefile config.status: creating libvipsCC/Makefile config.status: creating cplusplus/include/Makefile config.status: creating cplusplus/include/vips/Makefile config.status: creating cplusplus/Makefile config.status: creating tools/Makefile config.status: creating tools/batch_crop config.status: creating tools/batch_image_convert config.status: creating tools/batch_rubber_sheet config.status: creating tools/light_correct config.status: creating tools/shrink_width config.status: creating python/Makefile config.status: creating python/packages/Makefile config.status: creating python/packages/gi/Makefile config.status: creating python/packages/gi/overrides/Makefile config.status: creating test/Makefile config.status: creating test/variables.sh config.status: creating swig/Makefile config.status: creating swig/vipsCC/Makefile config.status: creating man/Makefile config.status: creating doc/Makefile config.status: creating doc/libvips-docs.xml config.status: creating po/Makefile.in config.status: creating config.h config.status: config.h is unchanged config.status: executing depfiles commands config.status: executing default-1 commands config.status: executing libtool commands * build options native win32: no native OS X: no open files in binary mode: no enable debug: no build deprecated components: yes build docs with gtkdoc: no gobject introspection: no build vips7 Python binding: no install vips8 Python overrides: no (requires pygobject-3.13.0 or later) build radiance support: yes build analyze support: yes build PPM support: yes * optional dependencies use fftw3 for FFT: yes Magick package: MagickCore file import with libMagick: yes (magick6) accelerate loops with orc: yes (requires orc-0.4.11 or later) ICC profile support with lcms: yes (lcms2) file import with OpenEXR: yes file import with OpenSlide: yes (requires openslide-3.3.0 or later) file import with matio: yes PDF import with poppler-glib: yes (requires poppler-glib 0.16.0 or later) SVG import with librsvg-2.0: yes (requires librsvg-2.0 2.34.0 or later) zlib: yes file import with cfitsio: yes file import/export with libwebp: yes (requires libwebp-0.1.3 or later) support webp metadata: no (requires libwebpmux-0.5 or later) text rendering with pangoft2: yes file import/export with libpng: yes (pkg-config libpng >= 1.2.9) (requires libpng-1.2.9 or later) file import/export with libtiff: yes (pkg-config libtiff-4) file import/export with giflib: yes (found by search) file import/export with libjpeg: yes image pyramid export: yes (requires libgsf-1 1.14.26 or later) use libexif to load/save JPEG metadata: yes ```

And on ec2 instances, I have VIPS 8.2 from Ubuntu package.

The strange thing is VIPS from sources works signifficantly slower than from the repo. This is VIPS 8.5 from sources compared with VIPS 8.2 from a package on the same computer:

Vips full cycle 2560×1600 RGB image
    Load+save           0.06170 s    66.39 Mpx/s
    +transpose          0.06942 s    59.00 Mpx/s
    +resize             0.27630 s    14.82 Mpx/s
    +blur               0.40508 s    10.11 Mpx/s

Vips full cycle 2560×1600 RGB image
    Load+save           0.06210 s    65.96 Mpx/s
    +transpose          0.07080 s    57.85 Mpx/s
    +resize             0.14978 s    27.35 Mpx/s
    +blur               0.27115 s    15.11 Mpx/s

I also tried to build VIPS 8.2 from sources, it has the same speed as 8.5 from sources. Do you have explanation for this?

jcupitt commented 7 years ago

Hello @homm! Nice! I'll try and run these tests on my laptop.

I noticed one small thing on the results browser: the hover box is clipped. I see:

so I can only see the first four items. I tried in Firefox and Chrome. All the charts seem to have this issue.

I'm not sure why the repo should be faster, perhaps different compiler settings? I'll investigate.

homm commented 7 years ago

I noticed one small thing on the results browser

Yes, this happens not in result browser, but in inline graphs in following sections. I'll try to fix.

jcupitt commented 7 years ago

I've had a look though the vips code, it's mostly fine, nice job.

I noticed of a couple of things:

In vips_load.py there are the lines:

im = Image.new_from_file(root('resources', self.filename))
im.write_to_memory()

This will do an extra copy operation. The steps internally are:

a. new_from_file will set up a delayed random-access load operation b. write_to_memory will ask for pixels, and this will trigger the JPEG load c. since this is random access to a small (relatively) image, the file will be loaded to a memory buffer ready for pixels to be served d. write_to_memory will now allocate a second memory buffer and write to that, copying all the pixels over

Instead, how about having:

im = Image.new_from_file(root('resources', self.filename))
im.getpoint(0, 0)

The getpoint will read pixel (0, 0). This will be enough to trigger the delayed load and pull the whole image into memory.

I'd do the same for SaveCase, though it won't make any difference to runtime, of course.

class SaveCase(BaseSaveCase):
    def create_test_data(self):
        im = Image.new_from_file(root('resources', self.filename))
        im.getpoint(0, 0)
        return [im]

On vips_full_cycle, I would use kernel, not interpolate for resize. The interpolate option is a left-over from the previous resize system.

                im = im.resize(0.4, kernel="cubic")

http://jcupitt.github.io/libvips/API/current/libvips-resample.html#vips-resize

Without a kernel option, it'll do the default lanczos3.

The min_ampl for gaussblur is very small. 0.01 means it'll generate the mask all the way out to the final 1%, which is just tiny. You can see the mask size like this:

$ export VIPS_INFO=1
$ vips gaussblur k2.jpg x.jpg 4 --min-ampl 0.01
VIPS-INFO: gaussblur mask width 25
VIPS-INFO: intize_to_fixed_point: too many underflows
VIPS-INFO: convi: using C path
VIPS-INFO: intize_to_fixed_point: too many underflows
VIPS-INFO: convi: using C path

So it's making a 25-point mask, then seeing so many underflows it's falling back to the C path. If you pick a larger number you get something closer to pillow, plus vips can use its vector mode.

I tried blurring a single 255 pixel to see the shape of the mask with min-ampl 0.3:

vips-point pil-point

And it looks a bit similar. It looks like pillow is a little asymmetric. Perhaps 0.3 is too high?

If I make all these changes then benchmark, I see:

8.4.5 from repo

Vips full cycle 2560×1600 RGB image
    Load+save           0.06268 s    65.35 Mpx/s
    +transpose          0.06520 s    62.82 Mpx/s
    +resize             0.12671 s    32.32 Mpx/s
    +blur               0.13739 s    29.81 Mpx/s

8.5 from source

Vips full cycle 2560×1600 RGB image
    Load+save           0.06329 s    64.71 Mpx/s
    +transpose          0.06449 s    63.52 Mpx/s
    +resize             0.13087 s    31.30 Mpx/s
    +blur               0.14103 s    29.04 Mpx/s

8.6 from source

Vips full cycle 2560×1600 RGB image
    Load+save           0.06619 s    61.88 Mpx/s
    +transpose          0.06721 s    60.94 Mpx/s
    +resize             0.06799 s    60.25 Mpx/s
    +blur               0.07813 s    52.43 Mpx/s

8.6 has some caching improvements which help resize in this case

current pillow-simd on this laptop

Full cycle 2560×1600 RGB image
    Load+save           0.06175 s    66.33 Mpx/s
    +transpose          0.07541 s    54.31 Mpx/s
    +resize             0.06005 s    68.21 Mpx/s
    +blur               0.09469 s    43.26 Mpx/s

homm commented 7 years ago

0.01 means it'll generate the mask all the way out to the final 1%, which is just tiny.

Yes, and this is the desired behavior: it should be as close as possible to ideal gaussian blur. I can see significant differences between the ideal and 0.1, so 0.3 is unacceptable. The maximum min_ampl value when I don't see any differences is '0.07'. From left to right: 0.001, 0.07, 0.1:

_out 3 vips

To be honest, I believe there is an error somewhere because the difference between '0.07' and '0.08' is too sharp.

I made corrections in this branch: https://github.com/python-pillow/pillow-perf/pull/66

homm commented 7 years ago

Sorry, reverted resize to interpolate because there is KeyError: 'kernel' on VIPS 8.2. Does VIPS 8.5 has backward compatibility with old interpolate argument?

jcupitt commented 7 years ago

It's backward-compatible, but not forwards, so old code works on new libvips, but new code can fail on old libvips (if that makes sense).

You'd need to add a version test to use a new feature, for example:

from .vips import Image, Interpolate, at_least_libvips

...

                if at_least_libvips(8, 3):
                    # resize with kernel added in 8.3
                    im = im.resize(0.4, kernel='cubic')
                else:
                    im = im.resize(0.4, interpolate=Interpolate.new('bicubic'))

jcupitt commented 7 years ago

OK, I'm happy with 0.07.

It's a shame the rot90 is in there: if you remove that, you can use libvips in streaming mode. If I comment out the rot and change the new_from_file to:

im = Image.new_from_file(root('resources', self.filename), access='sequential')

I see:

Vips full cycle 2560×1600 RGB image
    Load+save           0.04064 s   100.78 Mpx/s
    +transpose          0.03657 s   112.01 Mpx/s
    +resize             0.05128 s    79.87 Mpx/s
    +blur               0.06578 s    62.27 Mpx/s

homm commented 7 years ago

It's a shame the rot90 is in there

This is one of the common cases. Many JPEG files have an orientation tag in EXIF and this test is designed to emulate this situation. A real shame is there is still no autorotate in Pillow, by the way (

I see: Vips full cycle 2560×1600 RGB image Load+save 0.04064 s 100.78 Mpx/s

I believe this is a bug, VIPS with access='sequential' uses more than one core even if VIPS_CONCURRENCY=1 is set:

$ time ./run.py vips_full_cycle

Vips full cycle 2560×1600 RGB image
    Load+save           0.03131 s   130.80 Mpx/s

real    0m0.524s
user    0m0.740s
sys 0m0.016s

$ time taskset 0x1 ./run.py vips_full_cycle

Vips full cycle 2560×1600 RGB image
    Load+save           0.04763 s    86.00 Mpx/s

real    0m0.760s
user    0m0.696s
sys 0m0.032s

It's backward-compatible

But looks like interpolate=Interpolate.new('bicubic') doesn't affect filter in VIPS 8.5. It turns out this argument isn't backward-compatible.

You'd need to add a version test to use a new feature, for example

Fixed this and updated results: https://github.com/python-pillow/pillow-perf/pull/67

jcupitt commented 7 years ago

This is one of the common cases.

You'd normally have the rotate after the shrink though, wouldn't you? Otherwise you're rotating more data than you need to.

vipsthumbnail shrinks to a memory buffer using vips streams, then rotates to the output in a second pipeline.

I believe this is a bug, VIPS with access='sequential' uses more than one core even if VIPS_CONCURRENCY=1 is set:

libvips has double-buffered output plus a write-behind thread, so in streaming mode it's able to overlap JPEG decode and JPEG encode. Perhaps it is cheating.

homm commented 7 years ago

More info about slow source builds.

This one from source:

$ vips --version
vips-8.5.7-Tue Oct 10 23:00:33 MSK 2017

$ time perf record ./run.py vips_full_cycle -n 21

Vips full cycle 2560×1600 RGB image
    Load+save           0.06468 s    63.33 Mpx/s
    +transpose          0.07087 s    57.80 Mpx/s
    +resize             0.22533 s    18.18 Mpx/s
    +blur               0.31908 s    12.84 Mpx/s
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 2.347 MB perf.data (59275 samples) ]

real    0m14.656s
user    0m14.168s
sys 0m0.808s

This one from very similar computer but installed from the package:

$ vips --version
vips-8.2.2-Sat Jan 30 17:12:08 UTC 2016

$ time perf record ./run.py vips_full_cycle -n 21

Vips full cycle 2560×1600 RGB image
    Load+save           0.04792 s    85.47 Mpx/s
    +transpose          0.04944 s    82.85 Mpx/s
    +resize             0.12102 s    33.85 Mpx/s
    +blur               0.21658 s    18.91 Mpx/s
[ perf record: Woken up 6 times to write data ]
[ perf record: Captured and wrote 1.818 MB perf.data (44469 samples) ]

real    0m9.532s
user    0m9.796s
sys 0m0.252s

Sorry, there is no debug symbols in ubuntu packages, but it is clear, that the layout is completely different. As I understand, functions with _gen postfix are not SIMD-accelerated. How can I see a reason why SIMD-acceleration is failed?

I've tried to run with ORC_DEBUG=99 and log for the package versions is much much longer. But I haven't found anything useful in the log.

jcupitt commented 7 years ago

Huh that's interesting.

You can try VIPS_INFO with recent libvipses:

$ export VIPS_INFO=1
$ vipsthumbnail /data/john/pics/k2.jpg
VIPS-INFO: thumbnailing /data/john/pics/k2.jpg
VIPS-INFO: selected loader is VipsForeignLoadJpegFile
VIPS-INFO: input size is 1450 x 2048
VIPS-INFO: loading jpeg with factor 8 pre-shrink
VIPS-INFO: converting to processing space srgb
VIPS-INFO: residual reducev by 0.5
VIPS-INFO: reducev: 13 point mask
VIPS-INFO: reducev: using vector path
VIPS-INFO: reducev sequential line cache
VIPS-INFO: residual reduceh by 0.5
VIPS-INFO: reduceh: 13 point mask
VIPS-INFO: thumbnailing /data/john/pics/k2.jpg as /data/john/pics/tn_k2.jpg

You'd need to do something to make run.py show stdout.

homm commented 7 years ago

You'd need to do something to make run.py show stdout.

What exactly? I didn't intercept stdout in the benchmark

homm commented 7 years ago

You'd normally have the rotate after the shrink though, wouldn't you?

Most libraries and users rotate on image load or immediately after this. This is the most common scenario. To be fairer, I changed explicit rotate operation to autorotate flag: https://github.com/python-pillow/pillow-perf/pull/68. This helps a lot for the slow, not SIMD-optimized case, but doesn't make any difference for fast. By the way, autorotate + access='sequential' drops performance a lot.

Otherwise you're rotating more data than you need to.

I believe this is the case where VIPS could optimize things more than other libraries. As VIPS doesn't perform operations immediately, it could rearrange operations.

jcupitt commented 7 years ago

I didn't intercept stdout in the benchmark

Ooop, my fault, they were being blocked at the glib/pyvips boundary. Hang on, I'll fix it up.

It looks like the blur is falling off the vector path too, I'll see if I can fix that as well.

jcupitt commented 7 years ago

The libvips rotate on load isn't used much: normally you'd shrink first, then use a separate autorot operator near the end.

libvips is demand-driven, so it scans over the output image pulling pixels through the system. If there's a 90 degree rotate in the pipeline, the nice top-to-bottom scan becomes a horrible vertical scan and image streaming will break.

To fix this, you run the pipeline in two stages: first, you stream from the input through premultiply + shrink + unpremultiply + colour management to make a small memory image, then when that has completed, you run a second pipeline from the memory image through rotate to image output.

This way you can thumbnail huge images in only a little memory. Imagine trying to thumbnail a 300,000 x 300,000 pixel image if you did the rotate at the start.

homm commented 7 years ago

I understand this. Rotating before resizing leads to extra data manipulations for all libraries in this test suite, not only for VIPS. But I believe current results are more informative for libraries users because this is exactly how people usually work. one, two, three

Also, there is important note:

Method order is important when both rotating and extracting regions, for example rotate(x).extract(y) will produce a different result to extract(y).rotate(x).

So it much more simple to think in transposed picture coordinates, than translate coordinates provided by a user to "storage coordinates".

homm commented 7 years ago

I've installed pyvips from the master, but this still doesn't work:

$ G_MESSAGES_DEBUG=VIPS VIPS_INFO=1 ./run.py vips_full_cycle

jcupitt commented 7 years ago

I've been working on too many language bindings this summer :( The Ruby one works with env vars, the Python one redirects glib log output to logger, so you need:

import logging
logging.basicConfig(level=logging.INFO)

in your code. With this program:

import os
import logging

os.environ['VIPS_CONCURRENCY'] = '1'
logging.basicConfig(level=logging.INFO)

from pyvips import Image

im = Image.new_from_file('pineapple.jpeg')
im = im.rot90()
im = im.resize(0.4, kernel='cubic')
im = im.gaussblur(4, min_ampl=0.07)
im.write_to_buffer('.jpg', Q=85)

I see:

john@kiwi:~/try$ python vips.py
INFO:pyvips.vinterpolate:VIPS: residual reducev by 0.4
INFO:pyvips.vinterpolate:VIPS: reducev: 4 point mask
INFO:pyvips.vinterpolate:VIPS: reducev: using vector path
INFO:pyvips.vinterpolate:VIPS: reducev sequential line cache
INFO:pyvips.vinterpolate:VIPS: residual reduceh by 0.4
INFO:pyvips.vinterpolate:VIPS: reduceh: 4 point mask
INFO:pyvips.vinterpolate:VIPS: gaussblur mask width 19
INFO:pyvips.vinterpolate:VIPS: intize_to_fixed_point: too many underflows
INFO:pyvips.vinterpolate:VIPS: convi: using C path
INFO:pyvips.vinterpolate:VIPS: intize_to_fixed_point: too many underflows
INFO:pyvips.vinterpolate:VIPS: convi: using C path

So it's hitting the vector path for vertical reduce, but falling off vector for the gaussblur. I'll try to fix that.

jcupitt commented 7 years ago

On operation order, sharp does the rotate after the shrink. It has a fixed (horribly complicated!) pipeline and the methods set parameters to that. You can see the pipeline here:

https://github.com/lovell/sharp/blob/master/src/pipeline.cc#L456

I've not been involved with bimg, but I agree it seems to rotate and flip at the start, which is wrong. I'll open an issue, thanks for pointing this out.

imgproxy seems to have copied the logic from bimg, I'll open an issue there too.

homm commented 7 years ago

This is my output:

INFO:pyvips.vinterpolate:VIPS: shrinkv by 2
INFO:pyvips.vinterpolate:VIPS: shrinkv sequential line cache
INFO:pyvips.vinterpolate:VIPS: shrinkh by 2
INFO:pyvips.vinterpolate:VIPS: residual reducev by 0.8
INFO:pyvips.vinterpolate:VIPS: 4 point mask
INFO:pyvips.vinterpolate:VIPS: using vector path
INFO:pyvips.vinterpolate:VIPS: reducev sequential line cache
INFO:pyvips.vinterpolate:VIPS: residual reduceh by 0.8
INFO:pyvips.vinterpolate:VIPS: 4 point mask
INFO:pyvips.vinterpolate:VIPS: gaussblur mask width 19
INFO:pyvips.vinterpolate:VIPS: intize_to_fixed_point: too many underflows
INFO:pyvips.vinterpolate:VIPS: using C path
INFO:pyvips.vinterpolate:VIPS: intize_to_fixed_point: too many underflows
INFO:pyvips.vinterpolate:VIPS: using C path

jcupitt commented 7 years ago

OK, I think I see the problem, let me hack on it for a few days.

Thank you for looking into all this, it's good to uncover these horrors.

homm commented 7 years ago

Do I understand correctly that this is a bug which should be fixed in libvips sources and there is no way to fix it from pyvips code or using other settings? So I can merge last part and this is the final results for current VIPS version?

jcupitt commented 7 years ago

Yes, I think it can't be fixed with current libvips. Your test has unfortunately hit two weak points: cubic downsize is slower than it should be (as you pointed out, it's not properly adaptive), and the current vector path for convolution can't do large radius gaussblur.

I think it's not much work to resolve this, but it'll be 8.6, so please go ahead with the results you have.

Here are the issues for moving the rotate to the end of the resize pipeline, for reference:

https://github.com/h2non/bimg/issues/201

https://github.com/DarthSim/imgproxy/issues/24

homm commented 7 years ago

Clear.

I'm still wondering do you have any explanation why the build from Ubuntu packages works in a different way and much faster?

I've subscribed to the rotation issues, will follow.

I have one more question. Is there any way in pyvips to read image properties before load it? For example, when you are calling Image.open in Pillow, it just returns ImageFile object with basic properties (like info, mode, width, height) without decoding the whole image.

I think very similar could be achieved with access='sequential', but separate method could be significant faster:

In [9]: %timeit pyvips.Image.new_from_file('../imgs/pineapple2k.jpeg', access='sequential')
100 loops, best of 3: 2.89 ms per loop

In [12]: %timeit Image.open('../imgs/pineapple2k.jpeg')
1000 loops, best of 3: 314 µs per loop

jcupitt commented 7 years ago

Sorry, I don't know why the Ubuntu packages are faster for you. They seem to run the same speed here.

Yes, libvips just loads image properties on new_from_file, no pixels are decoded until they are needed. I guess the overhead you are seeing is FFI and the operation call. I ought to profile the Python code and see if there are any simple speedups I'm missing --- it's quite a bit slower than the old C-based Python binding.

I tried the same test in C:

/* Compile with:
 *
 *      gcc open.c `pkg-config vips --cflags --libs`
 *
 */

#include <vips/vips.h>

int
main( int argc, char **argv )
{       
        VipsImage *im;
        int i;

        if( VIPS_INIT( argv[0] ) )
                vips_error_exit( NULL );

        vips_cache_set_max( 0 );

        for( i = 0; i < 10000; i++ ) {
                im = vips_image_new_from_file( argv[1], NULL );
                g_object_unref( im );
        }

        return( 0 );
}

And I see:

$ time ./a.out ~/pics/k2.jpg$ time ./a.out ~/pics/k2.jpg
real    0m3.152s
user    0m2.724s
sys 0m0.416s

So also about 300us per open.

homm commented 7 years ago

@jcupitt Funny thing. I've tried to rearrange operation like you suggest:

    def runner(self):
        im = Image.new_from_file(root('resources', self.filename))

        if self.level > 1:
            im = self.resize(im, 0.4, kernel='cubic')

        if self.level > 2:
            im = im.gaussblur(4, min_ampl=0.07)

        if self.level > 0:
            im = im.rot90()

        im.write_to_buffer('.'+self.filetype, Q=85)

And get significant slowdown on the build from sources:

With autorotate:

Vips full cycle 2560×1600 RGB image
    Load+save           0.06201 s    66.06 Mpx/s
    +transpose          0.08122 s    50.43 Mpx/s
    +resize             0.17138 s    23.90 Mpx/s
    +blur               0.26338 s    15.55 Mpx/s

With rotation on the end

Vips full cycle 2560×1600 RGB image
    Load+save           0.06253 s    65.50 Mpx/s
    +transpose          0.06960 s    58.85 Mpx/s
    +resize             0.36972 s    11.08 Mpx/s
    +blur               0.52834 s     7.75 Mpx/s

And new perf output:

jcupitt commented 7 years ago

I made a couple of test programs to show the effect of swapping the pipeline around.

Here's the way it's arranged now, though I switched the kernel to the default (lanczos3), since it actually works properly, and dropped the sigma to 2 to stay on the vector path:

import os
import sys

os.environ['VIPS_CONCURRENCY'] = '1'

#import logging
#logging.basicConfig(level=logging.INFO)

import pyvips

im = pyvips.Image.new_from_file(sys.argv[1])
im = im.rot90()
im = im.resize(0.4)
im = im.gaussblur(2, min_ampl=0.07)
im.write_to_buffer('.jpg', Q=85)

I see:

$ time python vips.py pineapple.jpeg 
real    0m0.278s
user    0m0.260s
sys 0m0.024s

If I rearrange it like this (as vips thumbnail and sharp do it):

import os
import sys

os.environ['VIPS_CONCURRENCY'] = '1'

#import logging
#logging.basicConfig(level=logging.INFO)

import pyvips

im = pyvips.Image.new_from_file(sys.argv[1], access='sequential')
im = im.resize(0.4)
im = im.gaussblur(2, min_ampl=0.07)
im = im.copy_memory()
im = im.rot90()
im.write_to_buffer('.jpg', Q=85)

I see:

$ time python vips2.py pineapple.jpeg 
real    0m0.257s
user    0m0.220s
sys 0m0.036s

So a bit quicker, though it's not dramatic. Memory use will be lower since it won't need to decode the whole image at the start.

With a 10k x 10k JPEG it's much more impressive:

$ time python vips.py ~/pics/wtc.jpg
real    0m9.063s
user    0m4.452s
sys 0m4.740s
$ time python vips2.py ~/pics/wtc.jpg
real    0m2.048s
user    0m2.056s
sys 0m0.068s

More than four times faster, and memory use is much lower.

jcupitt commented 7 years ago

Sorry, I've still no insight into why the packages are faster :(

I'll try to investigate later, it would be useful to know.

jcupitt commented 7 years ago

I tried a Pillow one:

import sys
from PIL import Image, ImageFilter
from io import BytesIO

im = Image.open(sys.argv[1])
im = im.transpose(Image.ROTATE_270)
size = (int(im.size[0] * 0.4 + 0.5),
        int(im.size[1] * 0.4 + 0.5))
im = im.resize(size, Image.LANCZOS)
im = im.filter(ImageFilter.GaussianBlur(2))
im.save(BytesIO(), 'JPEG', quality=85)

Is that two-lobe or three-lobe lanczos, by the way?

Anyway, for pineapple, Pillow wins, but with larger images, vips edges ahead:

$ time python pillow.py pineapple.jpeg 
real    0m0.168s
user    0m0.144s
sys 0m0.020s
$ time python pillow.py ~/pics/wtc.jpg
real    0m2.275s
user    0m2.044s
sys 0m0.228s

I'm not sure why vips.py is so slow on large images (4.7s sys time!), I should check that too.

jcupitt commented 7 years ago

Oh heh, of course if I put the rot90 at the end, Pillow speeds up too:

$ time python pillow.py ~/pics/wtc.jpg
real    0m2.042s
user    0m1.860s
sys 0m0.180s

So pretty much a draw.

homm commented 7 years ago

Is that two-lobe or three-lobe lanczos, by the way?

In Pillow Lanczos is 3-lobe.

jcupitt commented 7 years ago

To see how much was libvips and how much was pyvips, I tried in C:

/* Compile with:
 *
 *      gcc vips2.c `pkg-config vips --cflags --libs`
 *
 */

#include <vips/vips.h>

int
main( int argc, char **argv )
{
        VipsImage *im;
        VipsImage *x;
        void *buf;
        size_t size;

        if( VIPS_INIT( argv[0] ) )
                vips_error_exit( NULL );

        if( !(im = vips_image_new_from_file( argv[1],
                "access", VIPS_ACCESS_SEQUENTIAL, 
                NULL )) )
                vips_error_exit( NULL );

        if( vips_resize( im, &x, 0.4, NULL ) )
                vips_error_exit( NULL );
        g_object_unref( im );
        im = x;

        if( vips_gaussblur( im, &x, 2.0, 
                "min_ampl", 0.07,
                NULL ) )
                vips_error_exit( NULL );
        g_object_unref( im );
        im = x;

        im = vips_image_copy_memory( im ); 

        if( vips_rot90( im, &x, NULL ) )
                vips_error_exit( NULL );
        g_object_unref( im );
        im = x;

        if( vips_image_write_to_buffer( im, ".jpg", &buf, &size,
                "Q", 85,
                NULL ) )
                vips_error_exit( NULL );

        g_object_unref( im );

        return( 0 );
}

I see:

$ time ./a.out pineapple.jpeg 
real    0m0.118s
user    0m0.100s
sys 0m0.020s
$ time ./a.out ~/pics/wtc.jpg
real    0m1.856s
user    0m1.892s
sys 0m0.040s

So it looks like pyvips slowness is a factor here too.

jcupitt commented 7 years ago

Hello again, I fixed up git master libvips a bit more. Cubic and linear resize are now properly adaptive, and I switched the convolution vector path from fixed point to half float. If I run git master pillow-perf I see:

$ ./run.py full_cycle
Full cycle 2560×1600 RGB image
    Load+save           0.06154 s    66.56 Mpx/s
    +transpose          0.07465 s    54.87 Mpx/s
    +resize             0.06081 s    67.35 Mpx/s
    +blur               0.07642 s    53.60 Mpx/s
$ ./run.py vips_full_cycle
Vips full cycle 2560×1600 RGB image
    Load+save           0.06587 s    62.19 Mpx/s
    +transpose          0.08125 s    50.41 Mpx/s
    +resize             0.08176 s    50.10 Mpx/s
    +blur               0.11350 s    36.09 Mpx/s

This should be in 8.6 in a few weeks.

It's not obvious to me why Pillow transpose + resize seems to be faster than just transpose, do you know why that is?

I think the next improvement would be to try to speed up pyvips itself, but it looks like that would involve switching cffi to API mode and generating and building a C extension, which is more work than I have time for right now.

I found out why vips.py was very slow on huge images -- it switches to going via disk if images are over 100mb uncompressed (of course, duh). If you set VIPS_DISC_THRESHOLD to 500mb the speed difference mostly vanishes for the 1-thread case.

I'm still unclear why the official packages are faster

homm commented 7 years ago

It's not obvious to me why Pillow transpose + resize seems to be faster than just transpose, do you know why that is?

Sure. The last operation in all cases is saving. So the larger image to save, the slower the saving operation. After resizing, image is reduced in 6.25 times. If resizing is faster than saving, the whole case with resizing will be faster.

homm commented 7 years ago

it switches to going via disk if images are over 100mb uncompressed

It's not clear to me. Does VIPS switch to disk bufer even with access='sequential'?

jcupitt commented 7 years ago

Oh, of course, the save is smaller.

No, it will never disk buffer in sequential mode, it'll fail with an error instead.

homm commented 7 years ago

No, it will never disk buffer in sequential mode

Yeah, right. I see now

homm commented 7 years ago

I'm getting following compilation error on current master:

composite.cpp: In function ‘void vips_composite_blend3(VipsComposite*, VipsBlendMode, v4f&, T*)’:
composite.cpp:676:23: error: conversion of scalar ‘int’ to vector ‘v4f {aka __vector(4) float}’ involves truncation
    f = g > 0 ? g : -1 * g;
                       ^
composite.cpp:814: confused by earlier errors, bailing out

homm commented 7 years ago

After change -1 * g to -g I get following:

composite.cpp: In function ‘void vips_combine_pixels3(VipsComposite*, VipsPel*, VipsPel**) [with T = unsigned char; long int min_T = 0l; long int max_T = 255l; VipsComposite = _VipsComposite; VipsPel = unsigned char]’:
composite.cpp:814:3: internal compiler error: in gimplify_expr, at gimplify.c:8858
   B = VIPS_CLIP( low, B, high );
   ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions.

$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609

jcupitt commented 7 years ago

Huh, strange, it builds cleanly on 17.04 with gcc 6.3. I'll try 5.4. Thanks for letting me know.

jcupitt commented 7 years ago

It builds cleanly in gcc-5.4.1 with -Wall. I'll try clang.

jcupitt commented 7 years ago

composite.cpp is using gcc vector mode (the __attribute__(vector_size()) stuff) and it seems to be a bit broken on gcc 5.4.0. I've updated configure to disable vector extensions on 4.x and 5.x.

Thanks!

homm commented 7 years ago

Yeah, It runs much faster on the laptop:

libvips 8.5.0:
Vips full cycle 2560×1600 RGB image
    Load+save           0.06354 s    64.46 Mpx/s
    +transpose          0.08367 s    48.96 Mpx/s
    +resize             0.17549 s    23.34 Mpx/s
    +blur               0.26749 s    15.31 Mpx/s

libvips 8.6.0:
Vips full cycle 2560×1600 RGB image
    Load+save           0.06294 s    65.07 Mpx/s
    +transpose          0.08452 s    48.46 Mpx/s
    +resize             0.09042 s    45.30 Mpx/s
    +blur               0.11938 s    34.31 Mpx/s

But there is still some problem on the same OS, the same compiler and very similar CPU (Ubuntu 16.04 64-bit, GCC 5.4, Intel Haswell generation) on the AWS EC2 c4.large server:

libvips 8.2.?:
Vips full cycle 2560×1600 RGB image
    Load+save           0.05109 s    80.18 Mpx/s
    +transpose          0.06705 s    61.09 Mpx/s
    +resize             0.13450 s    30.45 Mpx/s
    +blur               0.23270 s    17.60 Mpx/s

libvips 8.6.0:
Vips full cycle 2560×1600 RGB image
    Load+save           0.05029 s    81.45 Mpx/s
    +transpose          0.06930 s    59.10 Mpx/s
    +resize             0.13910 s    29.45 Mpx/s
    +blur               0.18010 s    22.74 Mpx/s

Blur becomes faster, resize doesn't.

Perf report on the laptop:

c4.large:

Reports are quite different.

jcupitt commented 7 years ago

That's very odd. Could you try running with INFO logging enabled?

I'll try on the big i7 at work tomorrow.

homm commented 7 years ago

It looks the same.

Laptop:

$ ./testsuite/run.py vips_full_cycle -n1
(8, 6, 0, 50)

Vips full cycle 2560×1600 RGB image
    Load+save           0.08717 s    46.99 Mpx/s
    +transpose          0.09384 s    43.65 Mpx/s
INFO:pyvips.vinterpolate:VIPS: residual reducev by 0.4
INFO:pyvips.vinterpolate:VIPS: reducev: 11 point mask
INFO:pyvips.vinterpolate:VIPS: reducev: using vector path
INFO:pyvips.vinterpolate:VIPS: reducev sequential line cache
INFO:pyvips.vinterpolate:VIPS: residual reduceh by 0.4
INFO:pyvips.vinterpolate:VIPS: reduceh: 11 point mask
    +resize             0.11130 s    36.80 Mpx/s
INFO:pyvips.vinterpolate:VIPS: residual reducev by 0.4
INFO:pyvips.vinterpolate:VIPS: reducev: 11 point mask
INFO:pyvips.vinterpolate:VIPS: reducev: using vector path
INFO:pyvips.vinterpolate:VIPS: reducev sequential line cache
INFO:pyvips.vinterpolate:VIPS: residual reduceh by 0.4
INFO:pyvips.vinterpolate:VIPS: reduceh: 11 point mask
INFO:pyvips.vinterpolate:VIPS: gaussblur mask width 19
INFO:pyvips.vinterpolate:VIPS: convi: using vector path
INFO:pyvips.vinterpolate:VIPS: convi: using vector path
    +blur               0.12818 s    31.95 Mpx/s

c4.large:

$ ./testsuite/run.py vips_full_cycle -n1
(8, 6, 0, 50)

Vips full cycle 2560×1600 RGB image
    Load+save           0.06522 s    62.81 Mpx/s
    +transpose          0.06863 s    59.68 Mpx/s
INFO:pyvips.vinterpolate:VIPS: residual reducev by 0.4
INFO:pyvips.vinterpolate:VIPS: reducev: 11 point mask
INFO:pyvips.vinterpolate:VIPS: reducev: using vector path
INFO:pyvips.vinterpolate:VIPS: reducev sequential line cache
INFO:pyvips.vinterpolate:VIPS: residual reduceh by 0.4
INFO:pyvips.vinterpolate:VIPS: reduceh: 11 point mask
    +resize             0.14126 s    29.00 Mpx/s
INFO:pyvips.vinterpolate:VIPS: residual reducev by 0.4
INFO:pyvips.vinterpolate:VIPS: reducev: 11 point mask
INFO:pyvips.vinterpolate:VIPS: reducev: using vector path
INFO:pyvips.vinterpolate:VIPS: reducev sequential line cache
INFO:pyvips.vinterpolate:VIPS: residual reduceh by 0.4
INFO:pyvips.vinterpolate:VIPS: reduceh: 11 point mask
INFO:pyvips.vinterpolate:VIPS: gaussblur mask width 19
INFO:pyvips.vinterpolate:VIPS: convi: using vector path
INFO:pyvips.vinterpolate:VIPS: convi: using vector path
    +blur               0.18203 s    22.50 Mpx/s

libvips / pyvips

pyvips results in pillow-perf benchmarks #21