libvips / pyvips

python binding for libvips using cffi
MIT License
631 stars 49 forks source link

tif image covert to pyramidal tiff format #259

Open InfinityBox opened 3 years ago

InfinityBox commented 3 years ago

Hi, I have a TIFF image that the size is 54757x67953, I tried to use tiffsave( ) and write_to_file( ) to covert the image to pyramidal tiff format, but they all reported the same question:

LZWDecode: Corrupted LZW table at scanline 52134 tiff2vips: read error

jcupitt commented 3 years ago

Hi @InfinityBox,

I would guess your TIFF has been truncated, and libtiff is failing to decompress the final few strips. Could you share the image file?

InfinityBox commented 3 years ago

Hi @InfinityBox,

I would guess your TIFF has been truncated, and libtiff is failing to decompress the final few strips. Could you share the image file?

It is about 5.5G, I don't know how to share it. But after your reply, I probably know the cause of the problem, which is the image truncation. Thank you!

abubelinha commented 2 years ago

As this issue is still open and the subject matches perfectly with my question, I hope you don't mind I reuse it.

I have always done this task with CLI version of libvips, by using the following syntax:

vips tiffsave "C:\i1.tif" "C:\i1.tif.p.tif" --compression jpeg --Q 50 --tile --tile-width 256 --tile-height 256 --pyramid
vips tiffsave "C:\i2.jpg" "C:\i2.jpg.p.tif" --compression jpeg --Q 50 --tile --tile-width 256 --tile-height 256 --pyramid

I want to translate that into pyvips syntax. Looking to other issues (#179, #289) I think the relevant functions are Image.new_from_file and Image.tiffsave. So I tried this:

import pyvips
images_list = ["C:\i1.tif", "C:\i2.jpg"]
for i in images_list:
   image = pyvips.Image.new_from_file(i)
   outputfile = i + '.p.tif'
   image.tiffsave(outputfile, compression='JPEG', tile=True, tile_width=256, tile_height=256, pyramid=True)
-------------------------------------------------------------------------------------
pyvips.error.Error: no value JPEG in gtype VipsForeignTiffCompression (54418784)
  pyvips: enum 'VipsForeignTiffCompression' has no member 'JPEG', should be one of:
  none, jpeg, deflate, packbits, ccittfax4, lzw, webp, zstd, jp2k
  1. I followed documentation which shows all those values in uppercase. Why so?

  2. Are default parameter values (i.e. tile_width & tile_height) documented somewhere?

  3. I am pretty new to Python, so I take the opportunity to ask if my code can be optimized (no idea if image must be closed somehow, for example). Specially because I have a very long list of files to be converted.

Many thanks in advance @jcupitt @abubelinha

jcupitt commented 2 years ago

Hi @abubelinha,

The main docs are here:

https://www.libvips.org/API/current/VipsForeignSave.html#vips-tiffsave

You need eg.:

image.tiffsave("x.tif", compression="jpeg", Q=50, tile=True, tile_width=256, tile_height=256, pyramid=True)

The big saving would be to enable sequential mode, so I'd use:

for filename in ["C:/i1.tif", "C:/i2.jpg"]:
    image = pyvips.Image.new_from_file(filename, access="sequential")
    image.tiffsave(f"{filename).p.tif", compression="jpeg", Q=50, tile=True, tile_width=256, tile_height=256, pyramid=True)

Note the forward (not back) slashes. You can use python multiprocessing to run several conversions at the same time, which will speed things up further.

jcupitt commented 2 years ago

Regarding uppercase, you can pass things like compression as enums or as strings. So eg.:

image.tiffsave("xx", compression=pyvips.enums.ForeignTiffCompression.JPEG)
image.tiffsave("xx", compression="jpeg")

Are equivalent. I find the strings more convenient, but some people prefer enums. Your IDE should autocomplete the enums as you type, so they aren't much more effort.

abubelinha commented 2 years ago

Thanks a lot for your really helpful comments @jcupitt !

abubelinha commented 2 years ago

Thanks a lot for your really helpful comments @jcupitt !

I am a bit surprised my (very simple) script memory consumption was more or less the same when using sequential mode. Also speed gain was not that huge (about 2.75%)

I probably don't know how to measure it correctly. I used memory_profiler like this:

Sequential:

C:\Python38\python -m memory_profiler dvd_vips.py
C:/Temp/1.tif [139398044 bytes] ---> C:/Temp/1.tif.pyr.tif [2446984 bytes]
C:/Temp/2.tif [180014936 bytes] ---> C:/Temp/2.tif.pyr.tif [4059742 bytes]
C:/Temp/3.tif [189432896 bytes] ---> C:/Temp/3.tif.pyr.tif [3983902 bytes]
C:/Temp/4.tif [199467632 bytes] ---> C:/Temp/4.tif.pyr.tif [4939036 bytes]
RUNNING TIME:  55.98920249938965
Filename: dvd_vips.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    10   25.742 MiB   25.742 MiB           1   @profile
    11                                         def pyvipstest():
    12   25.742 MiB    0.000 MiB           1    import os,time
    13   25.746 MiB    0.004 MiB           1    start_time = time.time()
    14   25.746 MiB    0.000 MiB           1    vipshome = config["vipshome"]
    15   25.766 MiB    0.020 MiB           1    os.environ['PATH'] = vipshome + ';' + os.environ['PATH']
    16   39.445 MiB   13.680 MiB           1    import pyvips
    17   39.445 MiB    0.000 MiB           1    path = "C:/Temp"
    18   39.445 MiB    0.000 MiB           1    images_list = os.listdir(path)
    19   44.910 MiB   -4.805 MiB           7    for i in images_list:
    20   44.910 MiB   -3.203 MiB           6            inputfile = path + '/' + i
    21   44.910 MiB   -3.203 MiB           6            outputfile = path + '/' + i + '.pyr.tif'
    22   44.910 MiB   -3.203 MiB           6            if os.path.isfile(inputfile):
    23   44.992 MiB    0.789 MiB           4                    image = pyvips.Image.new_from_file(inputfile, access="sequential")
    24   44.910 MiB    3.070 MiB           4                    image.tiffsave(outputfile, compression='jpeg', Q=70, tile=True, tile_width=256, tile_height=256, pyramid=True)
    25   44.910 MiB   -3.199 MiB           8                    print("{} [{} bytes] ---> {} [{} bytes]" \
    26   44.910 MiB   -1.602 MiB           4                            .format(inputfile, str(os.stat(inputfile).st_size) , outputfile , str(os.stat(outputfile).st_size)))
    27   43.309 MiB   -1.602 MiB           1    end_time = time.time()
    28   43.309 MiB    0.000 MiB           1    print("RUNNING TIME: ", end_time - start_time)

Non sequential:

C:\Python38\python -m memory_profiler dvd_vips.py
C:/Temp/1.tif [139398044 bytes] ---> C:/Temp/1.tif.pyr.tif [2446984 bytes]
C:/Temp/2.tif [180014936 bytes] ---> C:/Temp/2.tif.pyr.tif [4059742 bytes]
C:/Temp/3.tif [189432896 bytes] ---> C:/Temp/3.tif.pyr.tif [3983902 bytes]
C:/Temp/4.tif [199467632 bytes] ---> C:/Temp/4.tif.pyr.tif [4939036 bytes]
RUNNING TIME:  57.56929278373718
Filename: dvd_vips.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    10   25.711 MiB   25.711 MiB           1   @profile
    11                                         def pyvipstest():
    12   25.711 MiB    0.000 MiB           1    import os,time
    13   25.715 MiB    0.004 MiB           1    start_time = time.time()
    14   25.715 MiB    0.000 MiB           1    vipshome = config["vipshome"]
    15   25.734 MiB    0.020 MiB           1    os.environ['PATH'] = vipshome + ';' + os.environ['PATH']
    16   39.613 MiB   13.879 MiB           1    import pyvips
    17   39.613 MiB    0.000 MiB           1    path = "C:/Temp"
    18   39.613 MiB    0.000 MiB           1    images_list = os.listdir(path)
    19   43.621 MiB   -0.668 MiB           7    for i in images_list:
    20   43.621 MiB   -0.668 MiB           6            inputfile = path + '/' + i
    21   43.621 MiB   -0.668 MiB           6            outputfile = path + '/' + i + '.pyr.tif'
    22   43.621 MiB   -0.668 MiB           6            if os.path.isfile(inputfile):
    23   42.906 MiB    0.301 MiB           4                    image = pyvips.Image.new_from_file(inputfile)
    24   43.621 MiB    3.512 MiB           4                    image.tiffsave(outputfile, compression='jpeg', Q=70, tile=True, tile_width=256, tile_height=256, pyramid=True)
    25   43.621 MiB   -3.926 MiB           8                    print("{} [{} bytes] ---> {} [{} bytes]" \
    26   43.621 MiB   -0.668 MiB           4                            .format(inputfile, str(os.stat(inputfile).st_size) , outputfile , str(os.stat(outputfile).st_size)))
    27   43.621 MiB    0.000 MiB           1    end_time = time.time()
    28   43.621 MiB    0.000 MiB           1    print("RUNNING TIME: ", end_time - start_time)

That was a small set of 4 tiff scanned images (A3 sized, 400 dpi, between 139 and 200 MB size, 6500x10000 pixels).

Surprisingly for me, if I add a set of 16 small jpeg mobile phone photographies (<1MB size, 1488x1488 pixels) to my C:/Temp folder and re-run the same script, then I see a big difference:

So, a couple of questions:

  1. sequential: How is it possible that adding more work (16 jpeg images plus the previous 4 tiff), the script runs faster? (54.36 vs previous 55.98)
  2. non sequential: So it's a much harder work converting small jpegs than big tiffs?

Thanks in advance, and sorry about my newbie questions.

jcupitt commented 2 years ago

It's to do with the way libvips opens files. There's a chapter in the docs, if you've not seen it:

https://www.libvips.org/API/current/How-it-opens-files.md.html

It explains what seq mode does and how if affects speed and memory use.

I usually benchmark like this:

#!/usr/bin/python3

import sys
import pyvips

for filename in sys.argv[2:]:
    image = pyvips.Image.new_from_file(filename, access=sys.argv[1])
    image.tiffsave(f"{filename}.p.tif", 
            compression="jpeg", 
            Q=50, 
            tile=True, 
            tile_width=256, 
            tile_height=256, 
            pyramid=True)

Then with a large JPEG image:

$ vipsheader ~/pics/st-francis.jpg
/home/john/pics/st-francis.jpg: 30000x26319 uchar, 3 bands, srgb, jpegload
$ /usr/bin/time -f %M:%e ./convert-pyr.py random ~/pics/st-francis.jpg 
421388:12.28
$ /usr/bin/time -f %M:%e ./convert-pyr.py sequential ~/pics/st-francis.jpg 
468164:8.69

The two numbers are peak memory use in kb and elapsed time in seconds.

But there are a couple of problems: the memory use is not including the temporary file that libvips has to make for random access mode, and there's very little parallelism here, so the libvips threadpool actually makes things slower.

I would turn off threading, and force it to keep the temporary file in memory:

$ VIPS_DISC_THRESHOLD=-1 VIPS_CONCURRENCY=1 /usr/bin/time -f %M:%e ./convert-pyr.py random ~/pics/st-francis.jpg 
2424840:10.24
$ VIPS_DISC_THRESHOLD=-1 VIPS_CONCURRENCY=1 /usr/bin/time -f %M:%e ./convert-pyr.py sequential ~/pics/st-francis.jpg 
156688:7.72

So now it's 2.4gb for random, 150mb for seq, and seq is about 25% faster.

jcupitt commented 2 years ago

(this PC has 32 cores, the threading overhead will often be less)

NeelKanwal commented 2 years ago

Hi, I am trying to convert a WSI with scn format to ndpi format but gets this error:

raise Error('unable to write to file {0}'.format(vips_filename)) pyvips.error.Error: unable to write to file b'H14147-08A HES_2015-07-29 13_00_25.ndpi' VipsForeignSave: "H14147-08A HES_2015-07-29 13_00_25.ndpi" is not a known file format

Here is the code: current_img_400x = vips.Image.new_from_file(os.path.join(directory, fname), level=0, autocrop=True) current_img_400x.cast("uchar")[0:3].write_to_file(f"{new_fname}.{format_to_save}", tile=True, compression="jpeg", pyramid=True)

Any suggestion on how to do it? I am trying to load the same slide using OpenSlide but it takes some extra white space and creates trouble with masks. PyVIPS autocrop is something I am not able to find with the OpenSlide object.

jcupitt commented 2 years ago

Hi, libvips can't write .ndpi files, it can only read them.

You could make a standard pyramidal TIFF, would that work?