[BUG] PIL.UnidentifiedImageError: cannot identify image file

sz3029 commented 8 months ago

Description

I'm unable to get the PDF report for my tile extraction due to an image identification error. I'm not sure if that's related with the version of my Pillow package (10.0.1) or Slideflow. Any suggestion would be great, thank you very much

To Reproduce

I used the following code

P = sf.load_project(
        os.path.join(
            "/projects/aieng_pmb/AI_HRD/slideflow_ML/CLAM",
            data_description,
            backbone_model,
            magnification,
        )
    )
dataset_all = P.dataset(
        tile_px=256,   # Tile size, in pixels.
        tile_um=64,  # 0.2632 * 256 Tile size, in microns or magnification.
        min_tiles=8,
    )
dataset_all.tfrecord_report("/path/to/report")

I also tried to generate the report directly when I extracted the tiles, with

    P.extract_tiles(
        tile_px=256,  # size of tile in pixels
        tile_um=magnification,  # size of tile in micro-meters
        whitespace_fraction=0.5,  # discard tiles with this fraction of whitespace
        num_threads=4,  # number of threads
        qc="otsu",
        grayspace_fraction=1, 
        save_tiles = True,
        report = True,
        # by default tfrecords = true and image_format = '.jpg'
        # save files to tfrecords in .jpg format
    )

But the method tfrecord_report or extract_tiles produces the following error:

[19:00:14] INFO     Generating PDF (this may take some time)...                                                                                                                                                       
/home/user/.conda/envs/slideflow/lib/python3.9/site-packages/fpdf/fpdf.py:1918: UserWarning: Substituting font arial by core font helvetica
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/.conda/envs/slideflow/lib/python3.9/site-packages/slideflow/dataset.py", line 3316, in tfrecord_report
    pdf_report = ExtractionReport(reports, title='TFRecord Report')
  File "/home/user/.conda/envs/slideflow/lib/python3.9/site-packages/slideflow/slide/report.py", line 433, in __init__
    pdf.image(
  File "/home/user/.conda/envs/slideflow/lib/python3.9/site-packages/fpdf/fpdf.py", line 261, in wrapper
    return fn(self, *args, **kwargs)
  File "/home/user/.conda/envs/slideflow/lib/python3.9/site-packages/fpdf/fpdf.py", line 3748, in image
    name, img, info = self.preload_image(name, dims)
  File "/home/user/.conda/envs/slideflow/lib/python3.9/site-packages/fpdf/fpdf.py", line 3843, in preload_image
    info = ImageInfo(get_img_info(name, img, self.image_filter, dims))
  File "/home/user/.conda/envs/slideflow/lib/python3.9/site-packages/fpdf/image_parsing.py", line 121, in get_img_info
    img = Image.open(img_raw_data)
  File "/home/user/.conda/envs/slideflow/lib/python3.9/site-packages/PIL/Image.py", line 3280, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x2afe65a2bc20>

Environment:

Slideflow Version (e.g., 1.0): 2.1.1
Pillow version: 10.0.1
Slide image format: .svs
OS (e.g., Ubuntu): Linux

jamesdolezal commented 8 months ago

Thanks for the report! It's a bit difficult to tell at this point if it's a bug, an environmental issue, or a problem with a corrupt image. I've just pushed an update that should gracefully handle this error and skip culprit images.

Try pulling the latest update on the master branch and see if this helps. If the problem is a corrupt image, you should see only a single error message and the rest of the report should generate successfully, If you see hundreds of error messages and the PDF report is empty, it could be an environmental error due to mismatched package versions or a bug.

We have environments using Pillow 10.01, so I don't suspect that package is the problem.

sz3029 commented 8 months ago

Hi James, apologize for the delay and thank you for the quick fix. I pulled the update and tried to use pip install dist/slideflow* cupy-cuda11x from source, but I got a conflict message:

Processing ./dist/slideflow-2.1.1+1.gc74c2314-py3-none-any.whl
Processing ./dist/slideflow-2.1.1+1.gc74c231-py3-none-any.whl
Processing ./dist/slideflow-2.1.1+9.ga105e2cb.dirty-py3-none-any.whl
ERROR: Cannot install slideflow 2.1.1+1.gc74c231 (from /user/slideflow/dist/slideflow-2.1.1+1.gc74c231-py3-none-any.whl) and slideflow 2.1.1+1.gc74c2314 (from /user/slideflow/dist/slideflow-2.1.1+1.gc74c2314-py3-none-any.whl) because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested slideflow 2.1.1+1.gc74c2314 (from /user/slideflow/dist/slideflow-2.1.1+1.gc74c2314-py3-none-any.whl)
    The user requested slideflow 2.1.1+1.gc74c231 (from /user/slideflow/dist/slideflow-2.1.1+1.gc74c231-py3-none-any.whl)

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

There seems to be two versions. My current version of slideflow is 2.1.1+9.ga105e2cb.dirty and I'm using cuCIM and pytorch backend. I'm wondering which one of the versions should I choose for the update? Thank you!

jamesdolezal commented 8 months ago

In this case, specifying the --upgrade flag should work:

pip install --upgrade dist/slideflow*

sz3029 commented 8 months ago

Thank you! I'm able to resolve this using the current update.

jamesdolezal / slideflow