Open wietse-postfix opened 1 year ago
Preliminary tests have shown that commit operations are expensive in SQLite. Performing commit operations only every hundred upsert operations gave a performance of over 4500 records per second in a single threaded daemon on an idle system and still over 1500 records per second on a system under load. The tests were done with a loop of 10.000 records.
As we do not need transactional safety for every single record, the expected load should not pose a problem.
We envision two tuneable parameters on the daemon side:
These configuration parameters will not be exclusive but will act in combination. E.g. for a system that commits every 100 records and every five seconds, if 99 records have not yet been committed after a mail burst because no additional mail is received to cause that hundredth record, the data will still be saved to disk after at most five seconds when the timed commit kicks in.
In my comments below I assume batches with up to 100 updates, and an MTA sending 1500 updates/s
I suppose that the simulation involved a loop around blocking database update calls.
In some TLSRPT design, the MTA sends datagrams to the TLSRPT receiver, so that the MTA will not be blocked by the flow control that is part of a connection-oriented protocol.
Perhaps the TLSRPT receiver implementation can use distinct threads for flushing the database and for receiving updates from the MTA, so that the receiver won't miss too many updates during the database flush every 1/15th second?
Unlike a update-generating loop that blocks when a database flushes buffers, the MTA's updates will arrive stochastically in time. If a single-threaded receiver can handle 1500 updates/s in a blocking flow, then I expect that it will start to miss updates above 500/s with a stochastic flow. What happens in the real world will depend on kernel buffer capacity.
I wrote a in November 2023:
- To handle forseeable update rates, perhaps TLSRPT internal storage can be implemented as a collection of sequential append-only files with names that correspond to the corresponding reporting time window.
- As long a write to a (local) file is smaller than PIPE_BUF bytes (see below) the POSIX spec guarantees that the write is atomic. Combined with O_APPEND (see below), this guarantees that write-append operations will be serialized. My expectation is that the size of as status update will be well under the minimum PIPE_BUF value..
The above idea based on atomic appends will not work for a multi-writer implementation (multiple writers per file). On many BSD-based systems, PIPE_BUF is the required minimum of 512 bytes. That is already smaller than typical status updates observed with an actual implementation, and a design should be able to handle updates that have 3x the typical size. Gzip compression would reduce the size of an update to 60%, and would not solve the problem.
The idea can still work for a single-writer implementation (i.e one writer per file), because that does not need atomic appends.
The current design is a single-writer implementation buffering in RAM and writing to disk when a configurable amount of datagrams (1000 datagrams) has arrived or in times of low load after a configurable interval (5 seconds) has passed. The short blocking time to write data to disk is bridged by the kernel buffering.
However, the readout especially of the domain list indeed is challenging, but will be solved by switching the database each day.
Writing only happens for today´s data, while reading only happens fpr yesterday´s data. So at "UTC midnight" the reader will close the database, rename it to, say "yesterday.sqlite" and create a fresh database for the new day. This incurs only a very short interruption that should cause no problem, similar to the regularly commits.
That way reader and writer do not conflict at all.
The newest commit not only changed the livrary build to GNU Autotools but also added a program "bench" in tools/benchmark with several commandline parameters.
The "bench" tool first tries to measure the maximum rate of datagrams with a blocking socket. To avoid several warmup effects like caching, initial database popupation etc the --rampup prameter specifies the seconds to run before determining the maximum rate. After the ramp-up phase non-blocking sockets are used.
Then in an endless loop a numbe rof background threads specified with the--threads option are run at varying rates, starting at 10% of the maximum rate in 10% increments up to 90% before restarting at 10%. The background rate is divided by the number of threads so that the total background load should be the 10% to 90%.
The --burstwait parameter specifies the seconds to wait between burst loads. The --maxburst parameter specifies the maximum number of datagrams in a burst. The --maxburstsec parameter specifies the maximum number of seconds the burst should take. If either the maximum number of datagrams or the maximum time has been reached the brust load is stopped. In case of an error sending the datagran the burst load is also stopped.
This gives an impression of what peak loads can be handled during what average loads.
The --newsock parameter can be used to switch to reusing the exisiting connection and sockets. In the case of resuing sockets errors occur sometimes when sending, like E_AGAIN, and the error counters of the background threads go up. However when using a new tlsrpt_connection_t for each datagram like Postfix does, no such errors occur.
This note is based on "TLSRPT for MTAs" Version 0.01. I summarize my understanding of the global architecture, present ball-park performance numbers, and make suggestions for the internal storage.
Over-all architecture
Client-side library. Each call reports the status of one TLS session (success, or one of the specified modes). The library is written in C and may be called from MTAs written in C or any language that can call into C (examples: C++, Go, Java, and many dynamically-compiled languages).
TLSRPT receiver. This receives one report from a client library over some IPC channel and updates an internal log. There may be one TLSRPT receiver per MTA instance, or a shared receiver for a group of MTA instances. But see performance/reliability considerations below.
Storage layer. This persists status information until it is needed to generate a TLSRPT report.
TLSRPT reporter. This generates a daily aggregate report on behalf of one or more MTA instances, and submits the result according to a policy published by a mail sending domain.
Performance and reliability considerations
A high-performance MTA such as Postfix manages multiple concurrent SMTP connections (up to 100 by default). Each SMTP protocol engine and associated TLS engine are managed by one SMTP client process. Updates through the TLSRPT client library will therefore be made concurrently.
Depending on destinations and configuration, one can expect that a typical Postfix MTA will max out at ~300 outbound connections/second. This was ~300 in 2012 when TLS was not as universal as it is now (STARTTLS adds ~three TCP round trip times), and when computers and networks were a bit slower (but not by a lot). See Viktor Dukhovni's post in https://groups.google.com/g/mailing.postfix.users/c/pPcRJFJmdeA
The C client library does not guarantee that a status update will reach a TLSRPT receiver. A status that cannot be sent will be dropped without blocking progress in an MTA. It is therefore OK if the persistence layer cannot accept every status update, however it should not lose updates under forseeable loads.
The design considers using SQLite for storage. By default the SQLite update latency is measured in hundreds of milliseconds, i.e. 10 updates/second where a single Postfix instance needs up to ~300 updates/second. Part of this latency is caused by SQLite invoking fsync() for every update. These fsync() calls would not just slow down SQLite, but they would also hurt MTA performance, especially when a message has multiple SMTP destinations. Postfix is careful to call fsync() only once during the entire lifetime of a message; I had to convince Linux distributions to NOT fsync() the maillog file after every record, because their syslogd daemon was consuming more resources than all Postfix processes combined.
The SQLite update latency can be reduced by 'batching' database updates in a write-ahead log, (for example, PRAGMA journal_mode = WAL; PRAGMA wal_autocheckpoint = 0; PRAGMA synchronous = NORMAL;) but now you need to periodically flush the write-ahead log, or turn on wal_autocheckpoint. For examples, see https://stackoverflow.com/questions/21590824/sqlite-updating-one-record-is-very-relatively-slow
Observations and suggestions
I am not convinced that batching SQLite updates will be sufficient to handle forseeable status update rates from even a single Postfix MTA instance.
To handle forseeable update rates, perhaps TLSRPT internal storage can be implemented as a collection of sequential append-only files with names that correspond to the corresponding reporting time window.
As long a write to a (local) file is smaller than PIPE_BUF bytes (see below) the POSIX spec guarantees that the write is atomic. Combined with O_APPEND (see below), this guarantees that write-append operations will be serialized. My expectation is that the size of as status update will be well under the minimum PIPE_BUF value..
Background
Semantics of O_APPEND and atomic writes <= PIPE_BUF. https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
PIPE_BUF (_POSTIX_PIPE_BUF) is not smaller than 512 bytes. https://pubs.opengroup.org/onlinepubs/7908799/xsh/limits.h.html