Problem Description:
Dear ZLIB team,
I am currently working on a project that involves compressing PDF files using the ZLIB library. The compression process seems to work correctly, as evidenced by successful decompression and matching the original PDF content. However, when attempting to open the compressed files, I encounter format errors or corrupted file issues.
Please help to understand what I did wrong as I have been trying many different compressions and yours is by far the best I came across.
Code Overview
import os
import subprocess
import zlib
import pandas as pd
from config import INPUT_DATAFRAME, SHEET_NAME, INPUT_COLUMN, OUTPUT_CONVERSION
def compress_data(input_data, compression_level=-1, window_bits=zlib.MAX_WBITS):
try:
compressor = zlib.compressobj(compression_level, zlib.DEFLATED, window_bits)
compressed_data = compressor.compress(input_data) + compressor.flush(zlib.Z_FINISH)
return compressed_data, None
except zlib.error as e:
return None, e
def decompress_data(compressed_data):
try:
decompressed_data = zlib.decompress(compressed_data, wbits=zlib.MAX_WBITS)
return decompressed_data, None
except zlib.error as e:
return None, e
def convert_tif_to_pdf(input_file_path, output_file_path):
# Create the output directory if it doesn't exist
os.makedirs(OUTPUT_CONVERSION, exist_ok=True)
# Use tiff2pdf for lossless compression
subprocess.run([r'C:\Program Files (x86)\GnuWin32\bin\tiff2pdf.exe', '-o', output_file_path, input_file_path], check=True)
# Read the compressed PDF content
with open(output_file_path, 'rb') as pdf_file:
pdf_content = pdf_file.read()
# Compress the PDF content using zlib
compressed_pdf, compress_error = compress_data(pdf_content)
if compress_error:
print(f"Compression error: {compress_error}")
else:
print(f"Compressed data size: {len(compressed_pdf)} bytes")
# Save the compressed PDF content
compressed_output_path = os.path.splitext(output_file_path)[0] + '_compressed.pdf'
with open(compressed_output_path, 'wb') as compressed_file:
compressed_file.write(compressed_pdf)
print(f"Saved compressed PDF to: {compressed_output_path}")
# Decompress the PDF content
decompressed_pdf, decompress_error = decompress_data(compressed_pdf)
if decompress_error:
print(f"Decompression error: {decompress_error}")
else:
print(f"Decompressed data size: {len(decompressed_pdf)} bytes")
print(f"Decompression successful. Match: {decompressed_pdf == pdf_content}")
# Save the decompressed PDF content
decompressed_output_path = os.path.splitext(compressed_output_path)[0] + '_decompressed.pdf'
with open(decompressed_output_path, 'wb') as decompressed_file:
decompressed_file.write(decompressed_pdf)
print(f"Saved decompressed PDF to: {decompressed_output_path}")
if __name__ == "__main__":
# Load DataFrame from Excel using config values
df = pd.read_excel(INPUT_DATAFRAME, sheet_name=SHEET_NAME)
# Ensure the specified column exists in the DataFrame
if INPUT_COLUMN not in df.columns:
print(f"Error: Column '{INPUT_COLUMN}' not found in DataFrame.")
else:
# Iterate through each row and convert TIF to PDF
for index, row in df.iterrows():
file_path = row[INPUT_COLUMN]
# Check if the file exists
if os.path.exists(file_path):
# Assume the PDF will be saved in the OUTPUT_CONVERSION directory
pdf_output_path = os.path.join(OUTPUT_CONVERSION, os.path.basename(file_path.replace('.tif', '.pdf').replace('.tiff', '.pdf')))
# Convert TIF to PDF with lossless compression and decompression
convert_tif_to_pdf(file_path, pdf_output_path)
else:
print(f"File not found: {file_path}")
Observation:
The decompression process works without any problems.
The compressed files, despite being successfully decompressed, are reported as having format errors when attempting to open them.
Terminal Output Example:
(.venv) PS C:\Users\test> python "C:\Users\test\Tiff_Pdf_Conversion\tiff_to_pdf_test copy.py"
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
TIFFReadDirectory: Warning, P:\Sample TIFF\00000001_NEWENVE.tif: unknown field with tag 33536 (0x8300) encountered.
Compressed data size: 5079 bytes
Saved compressed PDF to: S:\TiffConversion\00000001_NEWENVE_compressed.pdf
Decompressed data size: 8017 bytes
Decompression successful. Match: True
Saved decompressed PDF to: S:\TiffConversion\00000001_NEWENVE_compressed_decompressed.pdf
TIFFReadDirectory: Warning, P:\Sample TIFF\dwsample-tiff-1920.tif: wrong data type 7 for "RichTIFFIPTC"; tag ignored.
TIFFReadDirectory: Warning, P:\Sample TIFF\dwsample-tiff-1920.tif: wrong data type 7 for "RichTIFFIPTC"; tag ignored.
TIFFReadDirectory: Warning, P:\Sample TIFF\dwsample-tiff-1920.tif: wrong data type 7 for "RichTIFFIPTC"; tag ignored.
TIFFReadDirectory: Warning, P:\Sample TIFF\dwsample-tiff-1920.tif: wrong data type 7 for "RichTIFFIPTC"; tag ignored.
Compressed data size: 4952510 bytes
Saved compressed PDF to: S:\TiffConversion\dwsample-tiff-1920_compressed.pdf
Decompressed data size: 7377213 bytes
Decompression successful. Match: True
Saved decompressed PDF to: S:\TiffConversion\dwsample-tiff-1920_compressed_decompressed.pdf
Additional Information:
The PDF files are initially generated through a conversion process from TIFF format, and the decompressed files match the original content.
I've verified that the issue is not related to file naming or encoding.
Question:
Are there any known considerations or limitations with ZLIB when compressing certain types of data or file formats, particularly when the compressed data is a PDF file?
I appreciate any insights or suggestions on how to troubleshoot and resolve this issue. Thank you for your assistance!
Problem Description: Dear ZLIB team, I am currently working on a project that involves compressing PDF files using the ZLIB library. The compression process seems to work correctly, as evidenced by successful decompression and matching the original PDF content. However, when attempting to open the compressed files, I encounter format errors or corrupted file issues.
Please help to understand what I did wrong as I have been trying many different compressions and yours is by far the best I came across.
Code Overview
Observation: The decompression process works without any problems. The compressed files, despite being successfully decompressed, are reported as having format errors when attempting to open them.
Terminal Output Example:
Additional Information: The PDF files are initially generated through a conversion process from TIFF format, and the decompressed files match the original content. I've verified that the issue is not related to file naming or encoding.
Question: Are there any known considerations or limitations with ZLIB when compressing certain types of data or file formats, particularly when the compressed data is a PDF file? I appreciate any insights or suggestions on how to troubleshoot and resolve this issue. Thank you for your assistance!