giampaolo / pyftpdlib

Extremely fast and scalable Python FTP server library
MIT License
1.68k stars 263 forks source link

Implementing a fully virtual filesystem for pyftpdlib #464

Open DannyZB opened 6 years ago

DannyZB commented 6 years ago

Hello,

I am looking for a way to implement a filesystem virtually over pyftpdlib:

The files do not exist on the system and exist on other servers, and their information and location are stored in the database.

What is needed is something that can extend somehow the filesystem to read custom data, where I would implement the data read and return routines.

How can this be done?

As an example of what I mean, SabreDAV implements this for WebDAV over PHP: http://sabre.io/dav/virtual-filesystems/

Is there any way to achieve this?

giampaolo commented 6 years ago

Well, you basically have to override all the methods of AbstractedFS class. "write" methods like mkdir() or rename() are easy (you just do INSERT or UPDATE). Same for "read" methods like getsize() or isdir(). The real challenge is listdir() (well, actually format_list and format_mlsx) and open(). open(), in particular, should return an object which read()s and/or write()s from/to the db a chunk at a time, which is probably gonna be a bit tricky if the "file" lives in the db. I know how to do it (and I did something similar for a a couple of closed source projects) but I'm not sure how to help you exactly as it's a kinda broad subject.

DannyZB commented 6 years ago

Well thanks for the help so far (-:

As for reading in chunks -> the damage can be reduced by buffering ahead, reading 64mb in advance for example for large files.

Any pointers as to issues I will likely run into?

giampaolo commented 6 years ago

Buffering is risky as the file content gets into ram and if the file is too big or you have multiple downloads at a time you may run out of ram pretty quickly. The"right way" to do it is the have open() return an object with a read() and a write() method which internally will do SELECT and UPDATE respectively. For read() you should also have an index to keep track of the file position (e.g. read a chunk of 4096 bytes then move the index so that the subsequent call to read() will collect the next chunk and so on...). In practice it's not so complex after all if you know how to interact with the db.

DannyZB commented 6 years ago

I think the ram issue can easily be resolved with a total buffer limit. My servers have 64gb ram so it's not an issue, But in general, I have adaptive buffering in mind - read ahead more the bigger the file, and the more the user tends to read large chunks(average last 5 minutes), And stop buffering large chunks when approaching the limit.

Thanks!

DannyZB commented 6 years ago

One last question *

Do writes and reads sometimes start at the middle of a file?(especially worries about writes)

giampaolo commented 6 years ago

Yes, both in upload and download. In FTP protocol the client can send a "REST " command to specify the file position, then use STOR or RETR commands to upload or download a file at a certain position (basically it's a "resume transfer" feature). Internally file.seek(position) is called; as such your open() class should also expose a seek(position) method other than read(size), write(chunk) and close().

Also, there is "APPE" command, which in a stanrdard filesystem it basically open()s the file in append ("ab") mode (then data is sent from client to server).

Basically that's what you have to take into account.

DannyZB commented 6 years ago

Thank you!

In reality, though, do you know in approximation how often it's used for anything but upload resume? "(basically it's a "resume transfer" feature)."

In my system there are actions done on files once they are fully uploaded, how would you know a file is fully uploaded? any indicators? Not talking %100 of the time, but I don't imagine editing files in the middle being a common FTP use-case -> am I wrong?

DannyZB commented 6 years ago

"(basically it's a "resume transfer" feature)."

I've read the latest FTP specification, and the only commands I've seen for writing data are STOR and APPE.

  1. APPE appends at the end of a file and STOR completely replaces. So there is no command for replacing the -middle- of the file? (replacing content partially -> not just appending)

  2. I've found the "mkstemp" command under AbstractFS. It's not part of an FTP command, so when does the library request a temporary file? does it store temporary files until a file is completely uploaded?

giampaolo commented 6 years ago

there is no command for replacing the -middle- of the file? (replacing content partially -> not just appending)

There is not. In general, and AFAIK, this is not possible at filesystem level, not only FTP.

I've found the "mkstemp" command under AbstractFS. It's not part of an FTP command, so when does the library request a temporary file?

"mkstemp" is used by STOU command, which uploads a file with a unique name (allowing a prefix).

giampaolo commented 6 years ago

In reality, though, do you know in approximation how often it's used for anything but upload resume?

Not sure what you mean.

Not talking %100 of the time, but I don't imagine editing files in the middle being a common FTP use-case -> am I wrong?

If a file is edited while being transferred that usually means it's either deleted or some data is appended to it (unless you're on windows, in which case the file will error out as "being in use"). AFAIK there is no way to edit only a part of the file and if there is I would consider it a corner case. In general, on RETR the server reads the file sequentially, from position 0 onward, a chunk at a time until EOF (end of file) is reached. What happens in between (file is "appended" or "edited") is "not server's problem", if you know what I mean. It will just keep reading/transmitting the file sequentially. The same goes if it's the client which sends data (STOR). Personally I would not worry about these kind of corner cases unless you really have to for some reason, in which case you may, e.g., use file locking (https://stackoverflow.com/questions/489861/locking-a-file-in-python).

how would you know a file is fully uploaded? any indicators?

Checkout on_file_received and on_incomplete_file_received callbacks: http://pyftpdlib.readthedocs.io/en/latest/tutorial.html#event-callbacks http://pyftpdlib.readthedocs.io/en/latest/api.html#pyftpdlib.handlers.FTPHandler.on_file_received http://pyftpdlib.readthedocs.io/en/latest/api.html#pyftpdlib.handlers.FTPHandler.on_incomplete_file_received

Hope this helps.

DannyZB commented 6 years ago

Last(probably) question: How do I force FTP clients to only login with TLS? I've tried the flags and I get errors on FileZilla "unexpected characters in TLS stream" or something like that

And: Thanks a lot for the help!

I have followed your advice and implemented a layer over a cloud filesystem. To make the file opening smoother and better managed I used a localhost nginx proxy > with locations for local and remote files. Nginx has very fine grained control of how and when to read files, overcame the queue/delay issues :-)

The code is far too complex to release as open source right now -> will take too much cleanup. I can let you have a look if you are interested in the methods used > maybe some can find their way into pyftpdlib itself :-) One thing that became the issue with the database is how fast and aggressive some clients are about scanning -everything- > we are talking 50 db queries per second per user! So I've implemented a caching layer over the DB to fetch all data at once and only check for updates every minute.

there is no command for replacing the -middle- of the file? (replacing content partially -> not just appending)

There is not. In general, and AFAIK, this is not possible at filesystem level, not only FTP.


Filesystems obviously allow editing the middle of files > otherwise databases would not be possible. But I get your point -> not possible here

giampaolo commented 6 years ago

How do I force FTP clients to only login with TLS?

TLS_FTPHandler.tls_control_required = True

I have followed your advice and implemented a layer over a cloud filesystem.

Cool!

I can let you have a look if you are interested in the methods used > maybe some can find their way into pyftpdlib itself :-)

Sure, feel free to paste it here or mail me at g.rodola@gmail.com.