haraka / Haraka

A fast, highly extensible, and event driven SMTP server
https://haraka.github.io
MIT License
5.02k stars 662 forks source link

Haraka and non-latin characters in DATA (non latin email encodings) #198

Closed fchenkov closed 12 years ago

fchenkov commented 12 years ago

When I send email which contains Russian characters to Haraka (which is configured to deliver received emails), every Russian character (in Windows1251 encoding) is replaced with some strange character sequence.

I have tried to fix this problem myself, but I couldn't - it looks like the problem is with top level code which splits incoming data by lines and converts it to strings. line_socket.js: function setup_line_processor (self) { var current_data = ''; self.process_data = function (data) { current_data += data; // <---- as I understand, there the buffer is converted to string and it goes wrong var results; while (results = line_regexp.exec(current_data)) { var this_line = results[1]; current_data = current_data.slice(this_line.length); self.emit('line', this_line); } };

As I understand, there is no way to fix this via plugins because they receive not raw data (buffers) but strings which are converted incorrectly. I believe that by default Haraka should transfer emails DATA exactly as it is, using buffers, without converting it to strings because it may lead to numerous bugs with different encodings. And only optionally it should parse DATA and provide access to email body or its parts for analysis or modification. What do you think is the best way to solve this problem?

baudehlo commented 12 years ago

So the problem that causes this bug is that the += data is adding a Buffer object, which does a toString() without an encoding parameter. The default "encoding" is UTF-8.

There's a "binary" encoding, which would "work" sort of, but is possibly subject to removal. See this issue in node's bug tracker for more details: https://github.com/joyent/node/issues/3246

The alternative is to just move to using Buffer's everywhere. That might be faster (performance-wise) but it's a lot of work.

A quick fix would be to change the line to: current_data += data.toString('binary');

I'd need to look more into what it entails to convert to Buffer objects everywhere.

fchenkov commented 12 years ago

First, I was wrong, the problem was here (connection.js):

Connection.prototype.process_data = function (data) {
    if (this.disconnected) {
        this.logwarn("data after disconnect from " + this.remote_ip);
        return;
    }

    this.current_data += data;  // !!! buffer is converted to string here   
    this._process_data();
};

Second, using data.toString('binary') doesn't help, because the resulting string is just malformed UTF8 string and then we send this string to SMTP server and it goes wrong. Third, we don't have to rewrite EVERYTHING using buffers. All we need is to rewrite the part which processes DATA section. I have hacked it this way: 1)

Connection.prototype.process_data = function (data) {
    if (this.disconnected) {
        this.logwarn("data after disconnect from " + this.remote_ip);
        return;
    }

    this.current_data += data.toString('binary');
    if (this.state === STATE_DATA) {
       this.transaction.add_raw_data(data); // save raw data as buffer, including end-of-data marker
    }
    this._process_data();
};

Next, we modify process_domain in outbound.js to just write transaction.raw_data to queue file instead of writing it line-by-line.

exports.process_domain = function (dom, recips, from, data_lines, hmails, notes, cb, raw_data) {

    var plugin = this;
    this.loginfo("Processing domain: " + dom);
    var fname = _fname();
    var tmp_path = path.join(queue_dir, '.' + fname);
    var ws = fs.createWriteStream(tmp_path, { flags: WRITE_EXCL });
    //console.log(raw_data);
    var data_pos = 0;
    var write_more = function () {
        var raw_data_without_end_of_data_marker = raw_data.slice(0, raw_data.length - 3); // strip .<cr><lf>; !!! assuming that raw_data does contain at least 3 bytes TODO
        ws.write(raw_data_without_end_of_data_marker); // save raw data to queue file
        ws.on('close', function () {
          var dest_path = path.join(queue_dir, fname);
          fs.rename(tmp_path, dest_path, function (err) {
              if (err) {
                   plugin.logerror("Unable to rename tmp file!: " + err);
                   cb(tmp_path, DENY, "Queue error");
               }
               else {
                   hmails.push(new HMailItem (fname, dest_path, notes));
                   cb(tmp_path, OK, "Queued!"); 
               };
          });           

        }); 
        ws.destroySoon();
    };     
    ws.on('error', function (err) {
        plugin.logerror("Unable to write queue file (" + fname + "): " + err);
        ws.destroy();
        cb(tmp_path, DENY, "Queueing failed");
    });

    ws.on('drain', write_more);
    plugin.build_todo(dom, recips, from, notes, ws, write_more);
}

What do you think about this fix? Should I commit? I know it is not optimal becaue the data section is stored in two places - as raw data and as strings.. what should we do about it?

fchenkov commented 12 years ago

The code nees cleanup, I know :)

fchenkov commented 12 years ago

Some problems caused by my fix: 1) send_trans_email in outbound.js adds some headers:

  // add in potentially missing headers
    if (!transaction.header.get_all('Message-Id').length) {
        this.loginfo("Adding missing Message-Id header");
        transaction.add_header('Message-Id', '<' + transaction.uuid + '@' + config.get('me') + '>');
    }
    if (!transaction.header.get_all('Date').length) {
        this.loginfo("Adding missing Date header");
        transaction.add_header('Date', date_to_str(new Date()));
    }

    transaction.add_header('Received', 'via haraka outbound.js at ' + date_to_str(new Date()));    

we lose these modifications if we trasnfer the DATA section as it is using transaction.raw_data. Are these headers necessary? More general question - should we allow Haraka to transfer the DATA section as it is using some settings or it is dangerous for some reasons? 2) plugins like spamassassin,skim_sign and all other plugins which add headers will stop working for the same reason But.. I believe that Haraka should have optional mode to transfer the DATA section as it is, without parsing - it is very useful when using it as a front-end for an MTA like Posfix which will check DATA section anyway.

smfreegard commented 12 years ago

This might be a stupid question; but what happens if you simply comment out this line:

https://github.com/baudehlo/Haraka/blob/master/connection.js#L448

As with this issue Haraka shouldn't really be advertising 8BITMIME and therefore with this commented out the sending MTA/MUA should be encoding any 8bit characters accordingly.

smfreegard commented 12 years ago

If we end up going converting everything to buffers, then https://github.com/bnoordhuis/node-buffertools would seem to be pretty useful here.

fchenkov commented 12 years ago

As with this issue Haraka shouldn't really be advertising 8BITMIME and therefore with this commented out the sending MTA/MUA should be encoding any 8bit characters accordingly.

The problem is that our customers' SMTP clients mostly use 8BITMIME encodings and I think it is better to make Haraka support this encoding anyway.

https://github.com/bnoordhuis/node-buffertools

Yes, I already use it in my fix in transaction.add_raw_data.

baudehlo commented 12 years ago

Yeah I don't think storing the buffer twice is going to fly - we're trying to get away from that.

Yes the headers are necessary.

I still don't understand why just using the binary encoding doesn't work. It should just keep it as bytes (albeit they have to be some sort of 8 bit character set, and \n has to be \n).

Though according to the mailing list the binary encoding is going to raise a warning in the next version of node, so we probably should work on a more complete solution.

fchenkov commented 12 years ago

I still don't understand why just using the binary encoding doesn't work. It should just keep it as bytes (albeit they have to be some sort of 8 bit character set, and \n has to be \n).

In Javascript, strings are unicode (that is, each symbol is represented by 2 bytes). How binary converion works:

'binary' - A way of encoding raw binary data into strings by using only the first 8 bits of each character. 

For charactes with codes from 0 to 127, it works correctly, because for these characters ASCII and Unicode codes are the same, but characters with codes > 127 must be enterpreted based on their encoding. For example, symbol with code 128 in Windows1251 encoding is one symbol, in Windows1252 encoding it is another symbol. If we just take a symbol with code, say, 128 and convert it to string using 'binary' encoding, we'll get some senseless Unicode symbol (one of these: http://en.wikipedia.org/wiki/List_of_Unicode_characters#C1_Controls_and_Latin-1_Supplement). And from this point we operate with senseless symbols. However, I've just figured out a way to fix this: when writing the body lines to file, we need to encode strings back to buffer using 'binary' encoding and it should work this way. How do you like this fix?

baudehlo commented 12 years ago

And from this point we operate with senseless symbols.

Right, but we don't care about the symbols - it's just a binary blob. The parse_body stuff takes care of parsing it into the correct character set (though admittedly that may need fixing with this too). And the headers have to be 7bit clean pretty much, but beyond that we don't care.

Frankly though I'm worried about the binary encoding going away, and so we probably need to just keep everything as a Buffer. That'll speed Haraka up anyway as it's one less malloc/copy.

fchenkov commented 12 years ago

If anyone interested, here is fix which I've tested and it works: 1)

outbound.js: 
Connection.prototype.process_data = function (data) {
...  
     this.current_data += data.toString('binary'); // first fix
 ...
};

2)

outbound.js:
exports.process_domain = function (dom, recips, from, data_lines, hmails, notes, cb) {
...  
      if (ws.write( new Buffer(data_lines[data_pos++].replace(/^\./m, '..').replace(/\r?\n/g, "\r\n"), 'binary') )) { // second fix
            write_more();
        }

Should I commit this fix and close this issue and create a new one regarding re-writing Haraka to work with buffers? Sorry for possibly stupid question - it is my first expirience with gitghub and with fixing some open-source software.

baudehlo commented 12 years ago

Yes, you'll need to fork the project, commit and push to your repository, and submit a pull request.

And yes, we'll open a new issue regarding converting the internals to use Buffers.

baudehlo commented 12 years ago

I've now fixed this in 29d1dc3