masneyb / gftp

gFTP is a free multithreaded file transfer client for *NIX based machines. 56 language translations available.
http://www.gftp.org
MIT License
119 stars 21 forks source link

Ascii vs Binary #74

Open wdlkmpx opened 4 years ago

wdlkmpx commented 4 years ago

https://www.jscape.com/blog/ftp-binary-and-ascii-transfer-types-and-the-case-of-corrupt-files

When I'm using Windows, I notice Notepad can't display text files created on a linux distro correctly. https://www.sciencehq.com/infos/info-why-cant-windows-notepad-display-linux-text-files-properly.html

But when using Linux I find that all (simple and complex) text editors can handle text files created on Windows. And Geany for example can change the line endings to CR / CR-LF and LF.

Not even Windows apps can handle files created with any other charsets than UTF-8 properly. I downloaded warez, I have to admit, and I opened files that displayed garbled text. I don't think FTP's ascii mode can fix that. It's a matter of opening them with the right apps.

Binary or Ascii? When it comes to Linux distros, the only reasonable download mode is: Binary.

In this day and age, some problems of the past no longer apply.

FTP and SSH servers usually run on nix machines and files are downloaded by nix users mostly. Files should not be modified by the (*nix) FTP client.

https://github.com/masneyb/gftp/#i-cant-transfer-certain-file-types-in-binary-mode-using-the-ftp-protocol

Possible actions:

Extreme actions:

wdlkmpx commented 4 years ago

gFTP's builtin text viewer can display text files created on Windows properly... GTK.

There's no builtin text editor.

If I change the text editor to Geany the result is not good. Geany behaves like a web browser.. it passes the file to the current running instance and then exits. This is fatal for files hosted on remote hosts.... bad bug. Geany can identify charsets, it can easily handle many complex cases.

But there's a very simple text editor: leafpad, a gtk2 app Gtk3 version: https://github.com/stevenhoneyman/l3afpad

Perhaps leafpad should be integrated into gftp as the text viewer/editor. In a subdirectory, enabled by default, disabled with --disable-text-editor

Or maybe the text viewer/editor value should default to leafpad. Making it clear that it's a dependency.

wdlkmpx commented 4 years ago

The nano and joe cli text editors can also handle CR-LF line endings.

Note: Some text files, like those using UTF-8 character encoding, may contain 
characters not supported by ASCII. For example, Japanese, Chinese or Korean 
characters. These text files should be transferred using binary mode. 

In GTK text editors you can choose the output charset.

And I get the feeling some interpreters don't care about line endings, I get the feeling python does not care.

It's becoming quite clear that the ASCII transfer mode is a thing of the distant past. It's an obsolete extension just like many parts of the X Window System Protocol.

swellhunter commented 3 years ago

Not quite obsolete, and still necessary for Servers that do not store files as ASCII. While FTP is not obsolete as a protocol, it should still support ASCII. Users may rarely use it if they want to preserve line endings.

It is more than just Windows/Mac/Unix. Which of the following looks cleaner to you?

curl -snB ftp://host:2100/hlq.PDS.JCL/SORT

wget -qO - ftp://host:2100/hlq.PDS.JCL/SORT|dd conv=ascii 2> /dev/null|gawk '{$1=$1}1' FPAT='.{80}' OFS='\n'

wdlkmpx commented 3 years ago

Not quite obsolete, and still necessary for Servers that do not store files as ASCII. While FTP is not obsolete as a protocol, it should still support ASCII. Users may rarely use it if they want to preserve line endings.

The ASCII mode should not be removed, that's correct. But it can be removed from the FTP Menu (main Window), to reduce its visibility. It's in the Preferences->FTP dialog.

The ASCII mode can corrupt text files, see: https://superuser.com/questions/1579299/clarification-on-ftp-ascii-vs-binary-transfer-mode-after-having-corrupted-fil

It may not play nice with languages other than English: https://stackoverflow.com/questions/32759621/is-there-any-reason-to-ftp-files-in-ascii-and-not-use-binary-by-default

It's an obscure feature nowadays, now that most apps can handle different line endings smoothly, it's certainly not a problem in the Linux world at least, even Windows has fixed this issue I think.

The real problem here is text encoding conversion, and I think an FTP client is not qualified to deal with that. Text editors like Geany or Notepad++ let you convert text encodings (and line endings) in a professional way... just by opening the file you see the text encoding and line ending, you can adjust the source encoding to fix corrupted text, what you see is what you get.

wdlkmpx commented 2 years ago

The problem here is people might only want to download a certain file in ASCII mode and then forget about it, Some people may forget to change the transfer mode to binary and stuff will happen in the next downloads and sessions. ASCII mode can (should) not be permanent, it should be specified for a download or a batch of downloads and automatically deactivated after that.

It's a different situation if you're using a CLI tool that is not interactive. You always have to specify the right params.

I was reading this the other day, and it seems to me that back in the day FTP was dominant or something, nowadays it's just a protocol some people want to die, but it's quite useful for LANs and public servers, and with TLS encryption it is s.e.c.u.r.e https://mywiki.wooledge.org/FtpMustDie#Yes.2C_Let.27s_Mangle_The_Data_By_Default.21

SSH transfers are slow as hell, so basically the best approach for speed is FTP with encrypted control connection and unencrypted file transfers... to be implemented as an option in gFTP..

swellhunter commented 2 years ago

It would be a shame to remove it completely. And it is not just the line endings, nowadays you are converting character widths as well. So single byte. Nodejs now has a uint_8 array for this, and the Python ftplib does not lend itself to file write() that includes the EOL. My obvious use case is EBCDIC, which needs to be remapped. I agree users can go unix2dos or dos2unix as they see fit, and 'Code/NP++ handle anything (well VSCode doesn't have a working FTP extension yet), but if it is in the RFC we should at least try to support it.

wdlkmpx commented 2 years ago

Of course I will not remove it, I want to reduce its visibility to the minimum, and change the tooltip to a warning (avoid like the plague if you're not aware of the possible side effects). I personally will never test that feature, who knows the state of the feature. Testing and feedback is somewhat missing

There is the main window, The FTP options tab in the preferences dialog, and the config file.

Only the option in the preferences dialog should remain, as gftp is a multiprotocol client, other protocols don't have ASCII mode

Nowadays people mostly use their own servers and usually don't need the line ending conversion, because all Linux apps handle different line endings gracefully, and welll, its not like there are many public FTP servers running on Windows, so it's a UNIX thing, basically.

wdlkmpx commented 2 years ago

I've been testing random public ftp servers from around the world and sometimes I see issues related to text encoding or something. Directories with Russian characters can't be opened

I always use binary mode, some servers choose ASCII to transfer file lists

 <<<
Error converting string ' Para sugerencias, consultas o información adicional nuestros correos 
' from character set (null) to character set UTF-8: Invalid byte sequence in conversion input
Error converting string ' electrónicos son:
' from character set (null) to character set UTF-8: Invalid byte sequence in conversion input

The gFTP code itself has a TODO, something to fix regarding the ascii mode.

Conversion happens on the fly I guess, and that's wrong, data transmission works in a mysterious way and unexpected things happen. It's not possible to resume transfers.

The only reasonable course of action is: download the file, test if it's a text file, test if has a different line ending than expected and only then the line ending conversion happens.

So when it comes to file transfers, the only reasonable mode is Binary, the ASCII conversion should happen after downloading the file

swellhunter commented 2 years ago

That seems sound if the client side knows the server encoding. Fine if it is IBM037, not so fine if it is IBM500. The server side knows how to convert to ASCII. In this case even if we transmit down as binary it must still use line endings to terminate logical records. So you get properly indented and formatted characters, just not the correct ones. Our issue on the client side would just be limited to which line ending to use? The mess in python where it is stripped and added back is not a good example.

wdlkmpx commented 2 years ago

I wonder how an app can detect a text file, maybe if it doesn't contain some specific escape codes??????. A CR or \r character can be in any binary file, executable, or whatever, if it gets stripped out the resulting file becomes corrupted

And how do the client and server detect the operating system of each other. huh? Many Windows servers send UNIX as the supported file list. And the client? Well, it doesn't have to provide any info

In the config file there is a way to specify the ascii transfer for specific file extensions, something like this but without a GUI http://www.coreftp.com/docs/web1/Ascii_vs_Binary_transfers.htm

It's clear to me that gFTP must always request a Binary transfer, the ASCII mode should be just a hint to post-process the downloaded file, detect da forbidden line ending and change it if necessary, or just call dos2unix

If the specs the say you should jump from the 2nd floor, you can do it, but you can also use a stairway, disrespecting the specs

wdlkmpx commented 2 years ago

I remember downloading a text file from a Russian ftp server and the text was garbled, I recognize Cyrillic characters when I see them. Geany was unable to display the file correctly

I'll try find the server and file again and download it in ASCII mode, if it's readable then the ASCII mode might make sense (it never makes sense for CJK chars), otherwise it's unnecessarily problematic

wdlkmpx commented 2 years ago

So it seems that the server doesn't perform conversions or something, and the client must remove or add \r

This is something that really should happen after downloading the file, maybe read the first 4kb and detect if the file has the proper line endings otherwise perform the conversion

That way gftp will no longer b a source of grief and sorrow to unsuspecting users that choose the ASCII mode and download binary data and text files

They killed Xorg instead of fixing and simplifying the protocol or the binary itself, but the FTP protocol can live on, the clients and servers can be adapted to fix what doesn't make sense

mckenzm commented 2 years ago

That's because the server cannot know what the client considers an EOL, DOS is CRLF and Unix (android,mac,linux) is just LF.
So it is a localisation on the client side. It might also just be CR. But it generally defaults on the server.

mckenzm commented 2 years ago

NB. a z/OS or MVS server does not know a file is text, but will translate it if you use ASCII, otherwise it passes EBCDIC as binary, record by record (inserts EOL) if it is "blocked" data. If is unblocked data it just comes as is.

wdlkmpx commented 2 years ago

I wonder if those exotic servers can be compiled on Linux? gFTP can handle all sorts of ftp servers but so far I've found only servers that identify as UNIX or Windows_NT (DOS).. file listings (not operating systems). 2 out of 7

   FTP_DIRTYPE_UNIX   = 1,
   FTP_DIRTYPE_EPLF   = 2,
   FTP_DIRTYPE_CRAY   = 3,
   FTP_DIRTYPE_NOVELL = 4,
   FTP_DIRTYPE_DOS    = 5,
   FTP_DIRTYPE_VMS    = 6,
   FTP_DIRTYPE_MVS    = 7,

I think the FTP protocol should have been updated in 1999 to deprecate the ASCII mode and make it optional, suggesting something like the implementation I'm proposing which has nothing to do with the FTP protocol and can work with any other protocol

When they saw the need for IPv6 support it was probably evident that UTF-8 or something other than ASCII will take over the world

Talking about file listings, one day I was about to make gFTP recognize Windows_NT as a valid keyword to switch to a DOS file listing, when I found a Windows server that reported Windows_NT but the file listing was UNIX. Some servers may also allow you to change the response to the SYST command, and that may break GFTP

mckenzm commented 2 years ago

Possibly, but conventions we have had for 30 or 40 years should be preserved. We end up with maintainers breaking Wi-Fi rather than hackers. The directory types above are another issue altogether. MVS is not hierarchical but qualified. VMS has multiple file versions, how on earth a developer today is going to test all these without a harness or an emulator is a worry. Anyway it is the server side that does the translate to ASCII in the case of z/OS or MVS anyway, we just need to tell it to. Not hard.