madler / zlib

A massively spiffy yet delicately unobtrusive compression library.
http://zlib.net/
Other
5.58k stars 2.43k forks source link

Issue with ZLIB Compressed Files Resulting in Format Error or Corrupted File #880

Closed LudoCorporateShark closed 9 months ago

LudoCorporateShark commented 9 months ago

Problem Description: Dear ZLIB team, I am currently working on a project that involves compressing PDF files using the ZLIB library. The compression process seems to work correctly, as evidenced by successful decompression and matching the original PDF content. However, when attempting to open the compressed files, I encounter format errors or corrupted file issues.

Please help to understand what I did wrong as I have been trying many different compressions and yours is by far the best I came across.

Code Overview

import os
import subprocess
import zlib
import pandas as pd
from config import INPUT_DATAFRAME, SHEET_NAME, INPUT_COLUMN, OUTPUT_CONVERSION

def compress_data(input_data, compression_level=-1, window_bits=zlib.MAX_WBITS):
    try:
        compressor = zlib.compressobj(compression_level, zlib.DEFLATED, window_bits)
        compressed_data = compressor.compress(input_data) + compressor.flush(zlib.Z_FINISH)
        return compressed_data, None
    except zlib.error as e:
        return None, e

def decompress_data(compressed_data):
    try:
        decompressed_data = zlib.decompress(compressed_data, wbits=zlib.MAX_WBITS)
        return decompressed_data, None
    except zlib.error as e:
        return None, e

def convert_tif_to_pdf(input_file_path, output_file_path):
    # Create the output directory if it doesn't exist
    os.makedirs(OUTPUT_CONVERSION, exist_ok=True)

    # Use tiff2pdf for lossless compression
    subprocess.run([r'C:\Program Files (x86)\GnuWin32\bin\tiff2pdf.exe', '-o', output_file_path, input_file_path], check=True)

    # Read the compressed PDF content
    with open(output_file_path, 'rb') as pdf_file:
        pdf_content = pdf_file.read()

    # Compress the PDF content using zlib
    compressed_pdf, compress_error = compress_data(pdf_content)

    if compress_error:
        print(f"Compression error: {compress_error}")
    else:
        print(f"Compressed data size: {len(compressed_pdf)} bytes")

        # Save the compressed PDF content
        compressed_output_path = os.path.splitext(output_file_path)[0] + '_compressed.pdf'
        with open(compressed_output_path, 'wb') as compressed_file:
            compressed_file.write(compressed_pdf)

        print(f"Saved compressed PDF to: {compressed_output_path}")

        # Decompress the PDF content
        decompressed_pdf, decompress_error = decompress_data(compressed_pdf)

        if decompress_error:
            print(f"Decompression error: {decompress_error}")
        else:
            print(f"Decompressed data size: {len(decompressed_pdf)} bytes")
            print(f"Decompression successful. Match: {decompressed_pdf == pdf_content}")

            # Save the decompressed PDF content
            decompressed_output_path = os.path.splitext(compressed_output_path)[0] + '_decompressed.pdf'
            with open(decompressed_output_path, 'wb') as decompressed_file:
                decompressed_file.write(decompressed_pdf)

            print(f"Saved decompressed PDF to: {decompressed_output_path}")

if __name__ == "__main__":
    # Load DataFrame from Excel using config values
    df = pd.read_excel(INPUT_DATAFRAME, sheet_name=SHEET_NAME)

    # Ensure the specified column exists in the DataFrame
    if INPUT_COLUMN not in df.columns:
        print(f"Error: Column '{INPUT_COLUMN}' not found in DataFrame.")
    else:
        # Iterate through each row and convert TIF to PDF
        for index, row in df.iterrows():
            file_path = row[INPUT_COLUMN]

            # Check if the file exists
            if os.path.exists(file_path):
                # Assume the PDF will be saved in the OUTPUT_CONVERSION directory
                pdf_output_path = os.path.join(OUTPUT_CONVERSION, os.path.basename(file_path.replace('.tif', '.pdf').replace('.tiff', '.pdf')))

                # Convert TIF to PDF with lossless compression and decompression
                convert_tif_to_pdf(file_path, pdf_output_path)

            else:
                print(f"File not found: {file_path}")

Observation: The decompression process works without any problems. The compressed files, despite being successfully decompressed, are reported as having format errors when attempting to open them.

Terminal Output Example:

(.venv) PS C:\Users\test> python "C:\Users\test\Tiff_Pdf_Conversion\tiff_to_pdf_test copy.py"
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
Compressed data size: 5079 bytes
Saved compressed PDF to: S:\TiffConversion\00000001_NEWENVE_compressed.pdf
Decompressed data size: 8017 bytes
Decompression successful. Match: True
Saved decompressed PDF to: S:\TiffConversion\00000001_NEWENVE_compressed_decompressed.pdf
TIFFReadDirectory: Warning, P:\Sample TIFF\dwsample-tiff-1920.tif: wrong data type 7 for "RichTIFFIPTC"; tag ignored.
TIFFReadDirectory: Warning, P:\Sample TIFF\dwsample-tiff-1920.tif: wrong data type 7 for "RichTIFFIPTC"; tag ignored.
TIFFReadDirectory: Warning, P:\Sample TIFF\dwsample-tiff-1920.tif: wrong data type 7 for "RichTIFFIPTC"; tag ignored.
TIFFReadDirectory: Warning, P:\Sample TIFF\dwsample-tiff-1920.tif: wrong data type 7 for "RichTIFFIPTC"; tag ignored.
Compressed data size: 4952510 bytes
Saved compressed PDF to: S:\TiffConversion\dwsample-tiff-1920_compressed.pdf
Decompressed data size: 7377213 bytes
Decompression successful. Match: True
Saved decompressed PDF to: S:\TiffConversion\dwsample-tiff-1920_compressed_decompressed.pdf

Additional Information: The PDF files are initially generated through a conversion process from TIFF format, and the decompressed files match the original content. I've verified that the issue is not related to file naming or encoding.

Question: Are there any known considerations or limitations with ZLIB when compressing certain types of data or file formats, particularly when the compressed data is a PDF file? I appreciate any insights or suggestions on how to troubleshoot and resolve this issue. Thank you for your assistance!

madler commented 9 months ago

This is not a zlib development issue. Please post your question on stackoverflow.com.