File size differs between Ubuntu and Windows

What did you do?

As part of a project I'm downloading a large number of images from the internet with the requests library and save them with PIL (pillow). In avoid re-downloading them every time I run this, I hard coded their MD5 hashes and check if they match beforehand.

What did you expect to happen?

I was expecting this to work independent of the machine and OS, since the MD5 is invariant to this.

What actually happened?

I developed on Ubuntu and everything works as expected. Today I tried it on Windows, but the MD5 hashes calculated on Ubuntu didn't match. I investigated a little further and found that the images differ slightly in the number of bytes on disk:

Ubuntu output:

6.1.0
148271 d446f0cacf6cca0ba9cc51fbdb128db6
1553685 e067de67d6b923c2c435f7f3f018b0e8
720430 c25cda9fbf4a0a3bd29369a4f8732c49
6218876 d36ed11d5225a7834ceb147259bdb11a

Windows output:

6.1.0
148382 039de88df359492b8b61733c2950ebdd
1553685 e067de67d6b923c2c435f7f3f018b0e8
719013 ca9f9d9956181151c2a4ed45a93c918e
6212286 01bbb41e9f07189fbe45e067ccffdb10

Could this be related to PIL?
If yes, is this expected behaviour?
If yes, what am I doing wrong or better what can I do to fix this?

What are your OS, Python and Pillow versions?

OS: Ubuntu 16.04 / Windows10
Python: python3.5.2 / python3.7.X
Pillow: 6.1.0

I also tried this on https://repl.it/languages/Python3 which runs "Linux" and python3.7.4. This results in the same output as my Ubuntu machine. Unfortunately I don't have any other setups to test this further.

from io import BytesIO
from os import path
import hashlib
import requests
import PIL
from PIL import Image

print(PIL.__version__)

def print_stats(fpath):
    md5 = hashlib.md5()
    with open(fpath, 'rb') as f:
        stream = f.read()
    md5.update(stream)
    num_bytes = len(stream)
    print(num_bytes, md5.hexdigest())

urls = (
    "https://upload.wikimedia.org/wikipedia/commons/0/00/Tuebingen_Neckarfront.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/d/de/Vincent_van_Gogh_Starry_Night.jpg",
)
exts = (".jpg", ".png")

for url in urls:
    content = BytesIO(requests.get(url).content)
    for ext in exts:
        fpath = path.splitext(path.basename(url))[0] + ext
        Image.open(content).save(fpath)
        print_stats(fpath)
    content.close()

To give you a quick reply, there is another issue like this - #3833 - and as in that issue, I would suspect that you have different versions of one of the Pillow dependencies installed on the different machines.

To try and understand your problem better, is there a reason you need to read the images and create hashes of that data, instead of just reading the file contents and comparing those hashes?

I would suspect that you have different versions of one of the Pillow dependencies installed on the different machines.

I followed this advice to get the versions for Ubuntu.

libjpeg-3b10b538.so.9.3.0 => /usr/local/lib/python3.5/dist-packages/PIL/./.libs/libjpeg-3b10b538.so.9.3.0 (0x00007fc86404f000)
libopenjp2-b3d7668a.so.2.3.1 => /usr/local/lib/python3.5/dist-packages/PIL/./.libs/libopenjp2-b3d7668a.so.2.3.1 (0x00007fc863dd8000)
libz-a147dcb0.so.1.2.3 => /usr/local/lib/python3.5/dist-packages/PIL/./.libs/libz-a147dcb0.so.1.2.3 (0x00007fc863bc3000)
libtiff-8267adfe.so.5.4.0 => /usr/local/lib/python3.5/dist-packages/PIL/./.libs/libtiff-8267adfe.so.5.4.0 (0x00007fc863928000)

I didn't find a way to do the same on Windows. Is there one?

is there a reason you need to read the images and create hashes of that data, instead of just reading the file contents and comparing those hashes?

I'm not sure if I got your question right. What is the difference between "read the images" and "reading the file contents"? Ultimately I want the following functionality:

I have a image database containing among others an URL and MD5 for each entry. I have a python script that does this:

Check if the image exists on disk. If yes, go to 2. Otherwise download the image.
Check if the MD5 matches with the value stored in the database. If yes, go to the next entry. Otherwise download the image.

Edit

I think I now know where you getting at. I changed to download to

with open(fpath, "wb") as fh:
    fh.write(requests.get(url).content)

and this works on both my test platforms. I did the detour over PIL to be able to save all images as JPEG. The disk space I would save by this is not worth the hassle of fiddling with the system libraries. Thus, I'm closing this. Thanks for the swift support.

You might want to consider answering the former question about how to find the library versions on Windows if you know an answer to that. It could come in handy for future clueless users like me.

python-pillow / Pillow