Open wietse-postfix opened 8 months ago
Preliminary tests have shown that commit operations are expensive in SQLite. Performing commit operations only every hundred upsert operations gave a performance of over 4500 records per second in a single threaded daemon on an idle system and still over 1500 records per second on a system under load. The tests were done with a loop of 10.000 records.
As we do not need transactional safety for every single record, the expected load should not pose a problem.
We envision two tuneable parameters on the daemon side:
These configuration parameters will not be exclusive but will act in combination. E.g. for a system that commits every 100 records and every five seconds, if 99 records have not yet been committed after a mail burst because no additional mail is received to cause that hundredth record, the data will still be saved to disk after at most five seconds when the timed commit kicks in.
In my comments below I assume batches with up to 100 updates, and an MTA sending 1500 updates/s
I suppose that the simulation involved a loop around blocking database update calls.
In some TLSRPT design, the MTA sends datagrams to the TLSRPT receiver, so that the MTA will not be blocked by the flow control that is part of a connection-oriented protocol.
Perhaps the TLSRPT receiver implementation can use distinct threads for flushing the database and for receiving updates from the MTA, so that the receiver won't miss too many updates during the database flush every 1/15th second?
Unlike a update-generating loop that blocks when a database flushes buffers, the MTA's updates will arrive stochastically in time. If a single-threaded receiver can handle 1500 updates/s in a blocking flow, then I expect that it will start to miss updates above 500/s with a stochastic flow. What happens in the real world will depend on kernel buffer capacity.
This note is based on "TLSRPT for MTAs" Version 0.01. I summarize my understanding of the global architecture, present ball-park performance numbers, and make suggestions for the internal storage.
Over-all architecture
Client-side library. Each call reports the status of one TLS session (success, or one of the specified modes). The library is written in C and may be called from MTAs written in C or any language that can call into C (examples: C++, Go, Java, and many dynamically-compiled languages).
TLSRPT receiver. This receives one report from a client library over some IPC channel and updates an internal log. There may be one TLSRPT receiver per MTA instance, or a shared receiver for a group of MTA instances. But see performance/reliability considerations below.
Storage layer. This persists status information until it is needed to generate a TLSRPT report.
TLSRPT reporter. This generates a daily aggregate report on behalf of one or more MTA instances, and submits the result according to a policy published by a mail sending domain.
Performance and reliability considerations
A high-performance MTA such as Postfix manages multiple concurrent SMTP connections (up to 100 by default). Each SMTP protocol engine and associated TLS engine are managed by one SMTP client process. Updates through the TLSRPT client library will therefore be made concurrently.
Depending on destinations and configuration, one can expect that a typical Postfix MTA will max out at ~300 outbound connections/second. This was ~300 in 2012 when TLS was not as universal as it is now (STARTTLS adds ~three TCP round trip times), and when computers and networks were a bit slower (but not by a lot). See Viktor Dukhovni's post in https://groups.google.com/g/mailing.postfix.users/c/pPcRJFJmdeA
The C client library does not guarantee that a status update will reach a TLSRPT receiver. A status that cannot be sent will be dropped without blocking progress in an MTA. It is therefore OK if the persistence layer cannot accept every status update, however it should not lose updates under forseeable loads.
The design considers using SQLite for storage. By default the SQLite update latency is measured in hundreds of milliseconds, i.e. 10 updates/second where a single Postfix instance needs up to ~300 updates/second. Part of this latency is caused by SQLite invoking fsync() for every update. These fsync() calls would not just slow down SQLite, but they would also hurt MTA performance, especially when a message has multiple SMTP destinations. Postfix is careful to call fsync() only once during the entire lifetime of a message; I had to convince Linux distributions to NOT fsync() the maillog file after every record, because their syslogd daemon was consuming more resources than all Postfix processes combined.
The SQLite update latency can be reduced by 'batching' database updates in a write-ahead log, (for example, PRAGMA journal_mode = WAL; PRAGMA wal_autocheckpoint = 0; PRAGMA synchronous = NORMAL;) but now you need to periodically flush the write-ahead log, or turn on wal_autocheckpoint. For examples, see https://stackoverflow.com/questions/21590824/sqlite-updating-one-record-is-very-relatively-slow
Observations and suggestions
I am not convinced that batching SQLite updates will be sufficient to handle forseeable status update rates from even a single Postfix MTA instance.
To handle forseeable update rates, perhaps TLSRPT internal storage can be implemented as a collection of sequential append-only files with names that correspond to the corresponding reporting time window.
As long a write to a (local) file is smaller than PIPE_BUF bytes (see below) the POSIX spec guarantees that the write is atomic. Combined with O_APPEND (see below), this guarantees that write-append operations will be serialized. My expectation is that the size of as status update will be well under the minimum PIPE_BUF value..
Background
Semantics of O_APPEND and atomic writes <= PIPE_BUF. https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
PIPE_BUF (_POSTIX_PIPE_BUF) is not smaller than 512 bytes. https://pubs.opengroup.org/onlinepubs/7908799/xsh/limits.h.html