kiwix / container-images

10 stars 4 forks source link

Changing mirrorbrain's check_large_file threshold #257

Closed rgaudin closed 11 months ago

rgaudin commented 11 months ago

Mirrobrain scanner has a $gig2 variable set at 2GiB It is used as a threshold on rsync scans Files larger than it trigger a Range-Request HTTP download. If the request fails a warning is printed

Those requests are not necessary and create load on mirrors

This changes this variable value to 256GiB

benoit74 commented 11 months ago

I'm not sure this is done for a good reason.

The code/documentation at https://github.com/poeml/mirrorbrain/blob/76f2909e33004a7f5e0dd52b816881eb9fbd4246/tools/scanner.pl#L1396-L1398 explains why this double-check is done on files larger than 2GB.

If this double-check fails, the file is marked as not available on the mirror (at least it should).

If you look for "cannot be delivered via HTTP! Skipping" log in Grafana you will see there are plenty of occurrences.

What is weird is that when I try to download one or two files which are supposed to fail, it works indeed. So it looks like the double-check is broken.

In conclusion, I would suggest to modify the value to never do the check, since this seems to consume load on servers (our scanner + the mirrors) and produce more harm than good.

rgaudin commented 11 months ago

In conclusion, I would suggest to modify the value to never do the check, since this seems to consume load on servers (our scanner + the mirrors) and produce more harm than good.

Well disabling is not easily feasible but if you think 256GiB is not enough, you can change the 38 value. 39 would be 512GiB, 40 1TiB and 41 2TiB. I'll let you do the change with what feels more appropriate

benoit74 commented 11 months ago

In fact I was thinking that we might produce files of up to 1TB, but this is for offspot cards (with many ZIMs), which are not served by the mirrors, not single ZIM or file. It is too early in the morning here, I probably need one more coffee. Your value is probably OK, let's keep it.

rgaudin commented 11 months ago

We do have a TB+ ZIM file (in dev, not synced) and increasing it has no consequence so maybe we should just set 2TiB once and forget about it. Hopefully it will last until we get rid of mb (if ever! 😵‍💫)

benoit74 commented 11 months ago

As discussed live, I will set it to 2^63 = 8 EiB so we never come back to this issue again (hopefully).