openzim / zimfarm

Farm operated by bots to grow and harvest new zim files
https://farm.openzim.org
GNU General Public License v3.0
81 stars 25 forks source link

Implement upload solution #38

Closed kelson42 closed 6 years ago

kelson42 commented 6 years ago

As the workers are thought to not run on the same machine as the dispatcher, all generated ZIM files need to be gathered somewhere. That is why we need to be able to upload the files from a worker machine to a dedicated place.

This upload process should match the following requirements:

kelson42 commented 6 years ago

I might have a simple basic idea is to create a docker image/container which simply handles the uploads (on the server side).

The technology of file transfer might be discussed, but basically it's a server which allows to upload a file which has to be able to do two things:

This could be quite easy to do that with the FTP protocoll with for example:

automactic commented 6 years ago

The problem with FTP (same with rsync) is the first time worker uploads, a fingerprint verification prompt will appear

kelson42 commented 6 years ago

@automactic FTP is user/pass based, nothing to do with any crypto proof or fingerprint. You just need the right user/pass.

automactic commented 6 years ago

Mmm... yes you don't need to verify key with FTP.

I don't know if WebDAV could be a good option.

kelson42 commented 6 years ago

@automactic I don't really care about the protocoll as long as it work. WebDAV might work but it's more risky than FTP and I'm not sure you will get such an easy solution like you might have for FTP with twisted or pyftpdlib. In addition they are for more less WebDAV clients than FPT clients.

automactic commented 6 years ago

Yes I will try to do this with pyftpdlib, looks promising. Since FTP does not encrypt password and everyone can see it. But if we use FTPS I believe the credentials should also be encrypted.

kelson42 commented 6 years ago

@automactic good... I would recommend to somehow use password with a short life time (maybe only one usage?). We can do that as the dispatcher distribute them and do the check.

automactic commented 6 years ago

It could also be a good idea to have the worker request a file transfer token from dispatcher and send this token to ftp server when uploading the file. Upon receiving the token, the ftp server verify it with dispatcher.

This way, the worker only need to transfer the password once (with dispatcher) during its lifetime. But user also cannot login to the ftp server with other tools (since users cannot see the token)

automactic commented 6 years ago

Due to some reasons I cannot connect to the ftp server once they are in the container, but it works fine outside the container. The error log is below

INFO:pyftpdlib:196.52.2.24:59014-[] FTP session opened (connect)
DEBUG:pyftpdlib:196.52.2.24:59014-[] -> 220 Welcome to Zimfarm Warehouse.
DEBUG:pyftpdlib:196.52.2.24:59014-[] <- USER admin
DEBUG:pyftpdlib:196.52.2.24:59014-[] -> 331 Username ok, send password.
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] <- PASS ******
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] -> 230 Hi, there!
INFO:pyftpdlib:196.52.2.24:59014-[admin] USER 'admin' logged in.
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] <- FEAT
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] -> 211 End FEAT.
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] <- OPTS UTF8 ON
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] -> 501 Invalid argument.
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] <- SYST
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] -> 215 UNIX Type: L8
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] <- PWD
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] -> 257 "/" is the current directory.
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] <- CWD /
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] -> 250 "/" is the current directory.
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] <- TYPE A
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] -> 200 Type set to: ASCII.
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] <- PASV
DEBUG:pyftpdlib:196.52.2.24:59014-[admin] -> 227 Entering passive mode (172,17,0,3,200,212).
DEBUG:pyftpdlib:[debug] call: close() (<FTPHandler(id=140131051023048, addr='196.52.2.24:59014', user='admin')>)
DEBUG:pyftpdlib:[debug] call: close() (<pyftpdlib.handlers.PassiveDTP listening 172.17.0.3:0 at 0x7f72cd171710>)
INFO:pyftpdlib:196.52.2.24:59014-[admin] FTP session closed (disconnect).

Any ideas?

kelson42 commented 6 years ago

@automactic I fully support https://github.com/openzim/zimfarm/issues/38#issuecomment-348602315. Regarding the transfer problem, you probably do not support the FTP passiv mode properly. Look a bit how FTP passiv works (you need basically to open other ports).

kelson42 commented 6 years ago

Here is for example an example fix https://github.com/stilliard/docker-pure-ftpd/pull/9/commits/da4dee5a1412bea0f5a40a4768561297145ecdc5

automactic commented 6 years ago

I took a further look into the issue and read more about passive FTP. It looks like after receiving PASV, the server is returning 227 Entering passive mode (172,17,0,3,109,116). The problem is when worker is uploading files, the data transfer happens on 172.17.0.0.3:28020, which is internal to docker network. We need a way to have the FTP server return the correct address.

automactic commented 6 years ago

I need to set masquerade_address on the FTP server

kelson42 commented 6 years ago

@automactic OK, so does it work ?

automactic commented 6 years ago

Yes it does, but when you need to run ftp server you need to set external ip in environment

kelson42 commented 6 years ago

@ dispatcher/warehouse should be moved to /. Considering that the warehouse does not probably run on the same serveur and that in any case is not depending from the dispatcher, it should not be part of it IMO.

kelson42 commented 6 years ago

@automactic Could you also merge your code to master so we can close that ticket?

automactic commented 6 years ago

The plain FTP server approach discussed in this issue is already implemented.