kiwix / container-images

10 stars 4 forks source link

Replace MirrorBrain by MirrorCache or Mirrorbits #239

Open kelson42 opened 1 year ago

kelson42 commented 1 year ago

Mirrorbrain is deprecated and there is a replacement http://www.mirrorcache.org/. We should probably migrate our architecture

rgaudin commented 1 year ago

This is becoming more important with one of our mirror (https://mirror.accum.se/mirror/kiwix.org/) hosting our files on multiple servers (it's a mirror frontend itself) making use of redirections which are not supported by mirrorbrain.

I don't know if that's supported in mirrorcache though but I know mb is not worth it.

In the mean time, I've duplicated the mirror entry so we point independently to the two offloaders I've seen files in. This wastes a lot of requests in scanMirror step but at least we can use the mirror…

rgaudin commented 1 year ago

See https://github.com/etix/mirrorbits as well

kelson42 commented 1 year ago

I had tested MirrorCache a long time ago and without remembering the details it was really too short on the features.

Mirrorbits seems more mature and deserve probably to give a try.

Here are the features we like or rely on:

@benoit74 @rgaudin Do you see other features which are important to us?

Now what needs to be decided is when and how we will proceed to move forward with this POC with Mirrorbits.

benoit74 commented 1 year ago

I have very little experience on this part of the stack.

One thing we struggle with currently is the scans of mirrors to refresh individual assets status. Currently this process has to be made one mirror at a time, it is not possible to run in parallel (at least we failed). The new solution must be able to run this scan in parallel, otherwise it is not scalable. As the number of mirrors grows, the time to scan all of them grows as well and our refresh period if getting bigger and bigger.

Currently the refresh period is getting pretty high, more than 2 hours at least: https://kiwixorg.grafana.net/d/bb0f0990-04c5-4314-8afc-6185ac49c668/mirrorbrain?orgId=1&from=1695625815425&to=1696230615425

rgaudin commented 1 year ago

We've decided that @benoit74 will assess mirrorbits in regard to our needs. What we want to know is:

benoit74 commented 1 year ago

This is my comparison chart so far.

❌ Not Supported, bad thing ✅ Supported ❓Unknown (meaning probably not)

Feature MirrorCache MirrorBits
Metalink (Metalink headers) ❓JSON file mentioned, but not compatible with aria obviously
Bittorrent files
Magnet links
Mirror mgmt via ftp/http/rsync HTTP only❓ (no access to file) FTP and RSYNC only
Priorisation of mirrors
Auto choice of mirrors based on client geo location Geo only Geo + AS number + custom rules
Multiple hashes of files
Easy update of mirrors file database (at file/directory level) ✅ Mirrorcache has been designed to fix Mirrorbrain issues around parallel scans and scans taking ages to update the DB
Support of very large files >100GB ✅ Probably ✅ Probably 
IPV6 support ✅ 
Documentation ❌ Too limited ❌ Too limited
Programming Language Perl GoLang
Database PostgreSQL Redis (with persistence)
Project liveness Project updated regularly ; Multiple PR closed on a regular basis, including last days / weeks No update since at least one year, no code change since 2020, many very simple pending PR without responses, still based on Golang 1.13 (Sept 2019)
Developers One main dev (Andrii Nikitin), working at openSuse (project supporter) ; another person helped a bit in the past One single dev (etix), based in Paris, former Videolan Ops + developer, no more activity on Github / personal blog / twitter
 Usage openSuse only ? Many websites mentioned, including some which have stopped using it

I'm really not convinced by those two solutions. I would probably prefer to stay with MirrorBrain for now until we find a better solution.

If we are forced to choose one now, I will try MirrorCache for:

Effort to implement MirrorCache given all missing features is however probably significant (1 month?). I have too limited experience of Bittorrent / Magnet links to say something very pertinent on that point. But since it is written in Perl, we probably need to hire an external developer to do our stuff.

rgaudin commented 1 year ago

Thank you ; very useful 👍

In this case, we're probably better off keeping mirrorbrain until we're forced out. Main concern is security obviously. Our data is not completely safe as we mount the downloads folder in rw in order to write the mirrors.html file in the update-mb-db job. We canshould find a way around that.

More concerning would be the possibility of altering mirrorbrain's response to inject redirections to our users.

Should we close this ticket for now?

rgaudin commented 1 year ago

Couple notes:

benoit74 commented 1 year ago

I did not noticed the last issue regarding the fact that jbkempf is maintaining mirrorbits live, it is indeed quite an important information. And your other points are important as well. I'm really puzzled by all this information.

kelson42 commented 1 year ago

We should gather the problems/challenges we have with mb to be able to complete comparaison.

benoit74 commented 1 year ago

❌ Not Supported, bad thing ✅ Supported ❓Unknown (meaning probably not)

Feature MirrorCache MirrorBits MirrorBrain
HTML list of mirrors
Metalink (Metalink headers) ❓JSON file mentioned, but not compatible with aria obviously
Bittorrent files ✅ (but only torrent creation, not announced to tracker to validate torrent file - working only thanks to our "custom" tracker)
Magnet links ❌ (supported but buggy)
Mirror mgmt via ftp/http/rsync HTTP only❓ (no access to file) FTP and RSYNC only FTP, RSYNC and HTTP
Priorisation of mirrors
Auto choice of mirrors based on client geo location Geo only Geo + AS number + custom rules Geo + AS number
Multiple hashes of files ✅ (found in JSON file)
Easy update of mirrors file database (at file/directory level) ✅ Mirrorcache has been designed to fix Mirrorbrain issues around parallel scans and scans taking ages to update the DB ✅ mirrorbits supports parallel scan (only one scan per mirror at a time obviously). Both rsync and FTP are efficient : rsync works off the list of files returned by rsync (uses the rsync bin) and FTP recursively CWD and ls in all folders. ❌ (no parallel scan, lock issue)
Support of very large files >100GB ✅ Probably ✅ Probably 
IPV6 support ✅ 
Documentation ❌ Too limited ❌ Too limited
Programming Language Perl GoLang Python (admin/management) + C (runtime HTTP)
Database PostgreSQL Redis (with persistence) PostgreSQL
Project liveness Project updated regularly ; Multiple PR closed on a regular basis, including last days / weeks No update since at least one year, no code change since 2020, many very simple pending PR without responses, still based on Golang 1.13 (Sept 2019), but some oversight by jbkempf (VLC) + some potential contributions from Jenkins team Dead
Developers One main dev (Andrii Nikitin), working at openSuse (project supporter) ; another person helped a bit in the past One single dev (etix), based in Paris, former Videolan Ops + developer, no more activity on Github / personal blog / twitter No more
 Usage openSuse only ? Many websites mentioned, including some which have stopped using it ?
benoit74 commented 1 year ago

Just updated with Mirrobrain column + fixes to Mirrobits details + new line regarding HTML home page

kelson42 commented 1 year ago

@benoit74 Thank you very much for this analysis. Looking at the results, it tends to confirm my first opinion that the easiest path would be to continue (by fixing a few details) with Mirrorbrain (at least for the moment). @rgaudin What is your analysis and proposal?

rgaudin commented 1 year ago

As discussed with @benoit74 my opinion is to continue with MB until we're forced out. In that case, should the environnement be the same, I support patching mirrorbits to add metalink support and hashes on same paths (both very easy). As for BT, it's relatively easy as well but whether it would be integrated upstream is another question.

kelson42 commented 1 year ago

OK, then I guess this ticket is implemented (at least for the short term), we will need to fork Mirrorbrain to fix the most urgent stuff.

rgaudin commented 1 year ago

I think we can just patch a couple things in our image without adding the burden of a fork. This guy's patch is a line in a perl script

lemeurherve commented 1 year ago

@benoit74 FWIW and while https://github.com/etix/mirrorbits/issues/138 is in progress, we're using mirrorbits on Jenkins Infrastructure, with our own docker image and helm chart, that might interest you:

benoit74 commented 1 year ago

Thanks a lot @lemeurherve for the pointers

rgaudin commented 1 month ago

@kelson42 please take another look

benoit74 commented 1 month ago

And have a look especially at https://github.com/etix/mirrorbits/issues/138 and https://github.com/etix/mirrorbits/issues/179 which shows that maintenance of mirrorbits is "getting better"