openzim / sotoki

StackExchange websites to ZIM scraper
https://library.kiwix.org/?category=stack_exchange
GNU General Public License v3.0
217 stars 25 forks source link

Incorrect image displayed #284

Closed rgaudin closed 1 year ago

rgaudin commented 1 year ago

Originally reported by @mazen-mardini in https://github.com/openzim/sotoki/issues/283#issuecomment-1489482005 we can see in this mathoverflow question that the image used in the bottom of the question is different from the one online at the time of the Dump (December 2022). Screenshots in #283

It is important to diagnose and understand what happened here

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

sam-masaki commented 1 year ago

I think this is caused by the way image files are named. File names are currently made from the Adler32 checksum of the image url calculated here and this is liable to have lots of collisions since the urls are mainly in the format "https://i.stack.imgur.com/XXXXX.jpg" and Adler32 doesn't distribute checksums very evenly, especially with small inputs. Just doing a quick test generating random stack imgur urls I found that out of 1,000 URLs about 10 would collide, and out of 10,000 URLs almost 1,000 would collide, which fits with how often I encounter obviously wrong images on larger offline SE sites.

I think it makes sense to switch to any common hash algorithm like MD5 or SHA1/2 since they're better suited to this kind of use and aren't much slower.

Even doing a one-line change and switching to zlib's crc32 would massively reduce the problem (down to about 40 collisions per million), but given the minimal performance cost I think the above solution is better.

Here's a quick demo of the issue:

from datetime import datetime

letters = string.ascii_lowercase + string.ascii_uppercase + string.digits
sums = set()
urls = set()
collisions = 0

before = datetime.now()

for i in range(1_000_000):
    url = "https://i.stack.imgur.com/" + ''.join(random.choice(letters) for i in range(5)) + ".jpg"
    while url in urls:
        url = "https://i.stack.imgur.com/" + ''.join(random.choice(letters) for i in range(5)) + ".jpg"

    #m = hashlib.md5(url.encode("utf8"), usedforsecurity=False)
    #m = hashlib.sha1(url.encode("utf8"), usedforsecurity=False)
    #m = hashlib.sha256(url.encode("utf8"), usedforsecurity=False)

    #checksum = m.digest()
    #checksum = zlib.crc32(url.encode("utf8"))
    checksum = zlib.adler32(url.encode("utf8"))

    if checksum in sums:
        collisions += 1

    sums.add(checksum)
    urls.add(url)

after = datetime.now()

print(collisions)
print((after - before))
benoit74 commented 1 year ago

adler32 is definitely not a good idea.

I would suggest to use md5, using hashlib is not an issue and performance cost is most probably negligible compared to other operations performed by the scraper like parsing tons of XML files.

@rgaudin @kelson42 : does it makes sense for you as well, since you have more experience on this kind of topics?

@sam-masaki : thank you a lot for the investigation and the valuable quick demo (even if I would have been convinced without it, it is way better with the demo 😉 ) ; would like to to propose a PR (no pressure, just a proposition)?

sam-masaki commented 1 year ago

@sam-masaki : thank you a lot for the investigation and the valuable quick demo ; would like to to propose a PR (no pressure, just a proposition)?

Thanks, I just wanted to be thorough since I'm not familiar with the codebase (and I wasn't familiar with adler32). I can put in a PR soon just not right now since it's late for me.

benoit74 commented 1 year ago

No hurry, the issue is open since March and we will probably not run again sotoki scrapers in Zimfarm right now, we can wait for this PR to be merged. Thank you!

mazen-mardini commented 1 year ago

Moreover, one could concatenate the filesize of the image with the checksum to further reduce collisions:

filename = f"{checksum}_{filesize}"
rgaudin commented 1 year ago

Thank you very much. It's not noted anywhere but I believe adler32 was chosen both because it's fast (although not much relevant when summing an url) but also because it's short. Path length matters in libzim at Creator stage because of its impact on RAM, especially with dozens of millions of entries.

We can give md5 a shot. We'd jump from 10chars to 32chars. We only have 190K images in stackoverflow at the moment; I wonder how much we're gonna get with this fix…

mazen-mardini commented 1 year ago

Thank you very much. It's not noted anywhere but I believe adler32 was chosen both because it's fast (although not much relevant when summing an url) but also because it's short. Path length matters in libzim at Creator stage because of its impact on RAM, especially with dozens of millions of entries.

We can give md5 a shot. We'd jump from 10chars to 32chars. We only have 190K images in stackoverflow at the moment; I wonder how much we're gonna get with this fix…

checksums = set()

for filename in filenames:
  checksum = create_hash(filename)
  if checksum not in checksums:
    checksums.add(checksum)
  else:
    print("Collision detected! Please edit create_hash")
    exit()

print("create_hash is good!")

Something like this should help developing a good "create_hash" function.

rgaudin commented 1 year ago

You're welcome to do it. You can short-circuit all actual processing, downloads and ZIM creation as long as you render the HTML so the Image downloader receives URLs. You could even maintain various lists and test several in one pass.