InternetAddress parsing is unacceptably slow

sigasigasiga commented 1 year ago

As subject says, InternetAddress parsing is very slow.

If I try to parse an email with like 30000 recipients, it takes about 800 seconds to finish on my machine (i7-3770).

I took a look at what Callgrind says:

 14,947,395  /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:_int_free [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
  7,700,026  /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:malloc [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
  6,476,923  /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:_int_malloc [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
  4,720,635  ???:addrspec_parse [/opt/drweb.com/lib/x86_64-linux-gnu/libgmime-3.0.so.0.206.1]
  4,436,250  /build/glibc-SzIz7B/glibc-2.31/elf/dl-lookup.c:_dl_lookup_symbol_x [/usr/lib/x86_64-linux-gnu/ld-2.31.so]
  4,310,129  /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:free [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
  3,952,572  ???:g_slice_alloc [/opt/drweb.com/lib/x86_64-linux-gnu/libgmime-3.0.so.0.206.1]

And we can clearly see that all the slowness comes from addrspec_parse function, where GString is made for every recipient, and then it does reallocations many many times.

Currently I can think of only 2 ways to somehow fix this: 1) Write some sort of StringBuilder and use it to create resulting address in order to reduce reallocation time 2) Parse all addresses lazily, only when they are actually requested

sigasigasiga commented 1 year ago

Whenever a solution is chosen, I can try implement it in a library

jstedfast commented 1 year ago

GString is essentially a StringBuilder (if you are referring to the .NET StringBuilder class).

jstedfast commented 1 year ago

g_string_new ("") calls g_string_sized_new(2), but g_string_sized_new() uses MAX(dfl_size, 64), so it should allocate with an initial buffer of 64 bytes.

https://github.com/GNOME/glib/blob/main/glib/gstring.c#L108

That should be plenty for your addrspec (which is 24 characters), so it shouldn't need to resize.

jstedfast commented 1 year ago

It might be possible to refactor the code to re-use the same GString over and over for building the addrspecs (just use g_string_truncate(str, 0) at the start of addrspec_parse) and then just g_malloc()/g_strlcpy() the buffer into a new char* string.

sigasigasiga commented 1 year ago

It might be possible to refactor the code to re-use the same GString over and over for building the addrspecs (just use g_string_truncate(str, 0) at the start of addrspec_parse) and then just g_malloc()/g_strlcpy() the buffer into a new char* string.

That's an interesting idea! Unfortunately, it doesn't seem to give a lot of improvement. After applying this patch with quick-and-dirty implementation of your idea, I got following results:

bychin ~/d/master (master)
> time ./bin/test ~/debug/10k.eml # original implementation

________________________________________________________
Executed in   42,81 secs    fish           external
   usr time   42,80 secs  597,00 micros   42,80 secs
   sys time    0,00 secs   70,00 micros    0,00 secs

bychin ~/d/master (master)
> time ./bin/test ~/debug/10k.eml # static GString

________________________________________________________
Executed in   39,71 secs    fish           external
   usr time   39,69 secs    0,00 micros   39,69 secs
   sys time    0,01 secs  819,00 micros    0,01 secs

jstedfast commented 1 year ago

Here's an implementation of my idea, but it seems to somehow be slower on my macOS box (arm64):

diff --git a/gmime/internet-address.c b/gmime/internet-address.c
index ae07d13f..08209973 100644
--- a/gmime/internet-address.c
+++ b/gmime/internet-address.c
@@ -98,7 +98,7 @@ enum {
    INTERNET_ADDRESS_FOLD   = 1 << 1,
 };

-static gboolean addrspec_parse (const char **in, const char *sentinels, char **addrspec, int *at);
+static gboolean addrspec_parse (GString *buffer, const char **in, const char *sentinels, char **addrspec, int *at);

 static void internet_address_class_init (InternetAddressClass *klass);
 static void internet_address_init (InternetAddress *ia, InternetAddressClass *klass);
@@ -387,14 +387,18 @@ internet_address_mailbox_new (const char *name, const char *addr)
 {
    InternetAddressMailbox *mailbox;
    const char *inptr = addr;
+   GString *buffer;

    g_return_val_if_fail (addr != NULL, NULL);

    mailbox = g_object_new (INTERNET_ADDRESS_TYPE_MAILBOX, NULL);
+   buffer = g_string_sized_new (64);

-   if (!addrspec_parse (&inptr, "", &mailbox->addr, &mailbox->at))
+   if (!addrspec_parse (buffer, &inptr, "", &mailbox->addr, &mailbox->at))
        mailbox->addr = g_strdup (addr);
-   
+
+   g_string_free (buffer, TRUE);
+
    _internet_address_set_name ((InternetAddress *) mailbox, name);

    return (InternetAddress *) mailbox;
@@ -412,6 +416,7 @@ void
 internet_address_mailbox_set_addr (InternetAddressMailbox *mailbox, const char *addr)
 {
    const char *inptr = addr;
+   GString *buffer;

    g_return_if_fail (INTERNET_ADDRESS_IS_MAILBOX (mailbox));

@@ -423,9 +428,13 @@ internet_address_mailbox_set_addr (InternetAddressMailbox *mailbox, const char *

    g_free (mailbox->addr);

-   if (!addrspec_parse (&inptr, "", &mailbox->addr, &mailbox->at))
+   buffer = g_string_sized_new (64);
+
+   if (!addrspec_parse (buffer, &inptr, "", &mailbox->addr, &mailbox->at))
        mailbox->addr = g_strdup (addr);
-   
+
+   g_string_free (buffer, TRUE);
+
    g_mime_event_emit (((InternetAddress *) mailbox)->changed, NULL);
 }

@@ -1632,19 +1641,26 @@ domain_parse (GString *str, const char **in, const char *sentinels)
    return dotatom_parse (str, in, sentinels);
 }

+static char *
+buffer_strdup (GString *buffer)
+{
+   char *str = g_malloc (buffer->len + 1);
+   strcpy (str, buffer->str);
+   return str;
+}
+
 static gboolean
-addrspec_parse (const char **in, const char *sentinels, char **addrspec, int *at)
+addrspec_parse (GString *buffer, const char **in, const char *sentinels, char **addrspec, int *at)
 {
    const char *inptr = *in;
-   GString *str;
-   
-   str = g_string_new ("");
-   
-   if (!localpart_parse (str, &inptr))
+
+   g_string_truncate (buffer, 0);
+
+   if (!localpart_parse (buffer, &inptr))
        goto error;

    if (*inptr == '\0' || strchr (sentinels, *inptr)) {
-       *addrspec = g_string_free (str, FALSE);
+       *addrspec = buffer_strdup (buffer);
        *in = inptr;
        *at = -1;
        return TRUE;
@@ -1653,8 +1669,8 @@ addrspec_parse (const char **in, const char *sentinels, char **addrspec, int *at
    if (*inptr != '@')
        goto error;

-   *at = str->len;
-   g_string_append_c (str, *inptr++);
+   *at = buffer->len;
+   g_string_append_c (buffer, *inptr++);

    if (*inptr == '\0')
        goto error;
@@ -1665,16 +1681,15 @@ addrspec_parse (const char **in, const char *sentinels, char **addrspec, int *at
    if (*inptr == '\0')
        goto error;

-   if (!domain_parse (str, &inptr, sentinels))
+   if (!domain_parse (buffer, &inptr, sentinels))
        goto error;

-   *addrspec = g_string_free (str, FALSE);
+   *addrspec = buffer_strdup (buffer);
    *in = inptr;

    return TRUE;

  error:
-   g_string_free (str, TRUE);
    *addrspec = NULL;
    *in = inptr;
    *at = -1;
@@ -1684,7 +1699,7 @@ addrspec_parse (const char **in, const char *sentinels, char **addrspec, int *at

 // TODO: rename to angleaddr_parse??
 static gboolean
-mailbox_parse (GMimeParserOptions *options, const char **in, const char *name, InternetAddress **address)
+mailbox_parse (GMimeParserOptions *options, GString *buffer, const char **in, const char *name, InternetAddress **address)
 {
    GMimeRfcComplianceMode mode = g_mime_parser_options_get_address_compliance_mode (options);
    const char *inptr = *in;
@@ -1728,7 +1743,7 @@ mailbox_parse (GMimeParserOptions *options, const char **in, const char *name, I
    // ';' as well in case the mailbox is within a group address.
    //
    // Example: <third@example.net, fourth@example.net>
-   if (!addrspec_parse (&inptr, COMMA_GREATER_THAN_OR_SEMICOLON, &addrspec, &at))
+   if (!addrspec_parse (buffer, &inptr, COMMA_GREATER_THAN_OR_SEMICOLON, &addrspec, &at))
        goto error;

    if (!skip_cfws (&inptr))
@@ -1797,7 +1812,7 @@ group_parse (InternetAddressGroup *group, GMimeParserOptions *options, const cha
 }

 static gboolean
-address_parse (GMimeParserOptions *options, AddressParserFlags flags, const char **in, const char **charset, InternetAddress **address, gint64 offset)
+address_parse (GMimeParserOptions *options, AddressParserFlags flags, GString *buffer, const char **in, const char **charset, InternetAddress **address, gint64 offset)
 {
    GMimeRfcComplianceMode mode = g_mime_parser_options_get_address_compliance_mode (options);
    int min_words = g_mime_parser_options_get_allow_addresses_without_domain (options) ? 1 : 0;
@@ -1885,7 +1900,7 @@ address_parse (GMimeParserOptions *options, AddressParserFlags flags, const char
        if (!(flags & ALLOW_MAILBOX))
            goto error;

-       if (!addrspec_parse (&inptr, sentinels, &addrspec, &at))
+       if (!addrspec_parse (buffer, &inptr, sentinels, &addrspec, &at))
            goto error;

        skip_lwsp (&inptr);
@@ -1961,7 +1976,7 @@ address_parse (GMimeParserOptions *options, AddressParserFlags flags, const char
        /* rewind back to the beginning of the local-part */
        inptr = start;

-       if (!addrspec_parse (&inptr, COMMA_GREATER_THAN_OR_SEMICOLON, &addrspec, &at))
+       if (!addrspec_parse (buffer, &inptr, COMMA_GREATER_THAN_OR_SEMICOLON, &addrspec, &at))
            goto error;

        skip_lwsp (&inptr);
@@ -2048,7 +2063,7 @@ address_parse (GMimeParserOptions *options, AddressParserFlags flags, const char
            name = g_strdup ("");
        }

-       retval = mailbox_parse (options, &inptr, name, address);
+       retval = mailbox_parse (options, buffer, &inptr, name, address);
        g_free (name);
        *in = inptr;

@@ -2072,6 +2087,7 @@ address_list_parse (InternetAddressList *list, GMimeParserOptions *options, cons
    InternetAddress *address;
    const char *charset;
    const char *inptr;
+   GString *buffer;

    if (!skip_cfws (in))
        return FALSE;
@@ -2080,7 +2096,9 @@ address_list_parse (InternetAddressList *list, GMimeParserOptions *options, cons

    if (*inptr == '\0')
        return FALSE;
-   
+
+   buffer = g_string_sized_new (64);
+
    while (*inptr) {
        gboolean separator_between_addrs = FALSE;

@@ -2089,7 +2107,7 @@ address_list_parse (InternetAddressList *list, GMimeParserOptions *options, cons

        charset = NULL;

-       if (!address_parse (options, ALLOW_ANY, &inptr, &charset, &address, offset)) {
+       if (!address_parse (options, ALLOW_ANY, buffer, &inptr, &charset, &address, offset)) {
            /* skip this address... */
            while (*inptr && *inptr != ',' && (!is_group || *inptr != ';'))
                inptr++;
@@ -2106,6 +2124,7 @@ address_list_parse (InternetAddressList *list, GMimeParserOptions *options, cons
        /* Note: we loop here in case there are any null addresses between commas */
        do {
            if (!skip_cfws (&inptr)) {
+               g_string_free (buffer, TRUE);
                *in = inptr;

                return FALSE;
@@ -2121,7 +2140,9 @@ address_list_parse (InternetAddressList *list, GMimeParserOptions *options, cons
        if (can_warn && !(separator_between_addrs || (*inptr == '\0') || (is_group && *inptr == ';')))
            _g_mime_parser_options_warn (options, offset, GMIME_WARN_INVALID_ADDRESS_LIST, *in);
    }
-   
+
+   g_string_free (buffer, TRUE);
+
    *in = inptr;

    return TRUE;

jstedfast commented 1 year ago

@sigasigasiga my patch should theoretically get better perf than yours since g_strdup() has to call strlen() on the string, then malloc(), then strcpy() whereas mine just uses the str->len (well, I renamed it to buffer, but same thing) to avoid the strlen() call.

That said, I also did not experience any improvement (in fact, it somehow got slower?).

sigasigasiga commented 1 year ago

@jstedfast I have another idea how to reduce allocations: store addresses in a GString, where each address is separated by \0:

+---+---+---+---+---+----+---+---+---+---+---+----+
| a | @ | b | . | c | \0 | d | @ | e | . | f | \0 |
+---+---+---+---+---+----+---+---+---+---+---+----+

This way GString would become big enough pretty quickly, and thus won't reallocate that often. Resulting GString probably could be stored inside InternetAddressList, but I'm not sure about that.

jstedfast commented 1 year ago

Most likely there's something more costly than the allocations

jstedfast commented 1 year ago

Tested a theory I had which is that it's not the parser that is slow at all, but rather the fact that there is a separate To: header for each set of 3 addresses, which means we parse hundreds of short lists of addresses and then concatenate them.

If I modify the message such that all of the addresses are in a single To: header, the performance is REALLY fast:

real    0m0.476s
user    0m0.067s
sys 0m0.051s

By comparison, this is what I get w/ your original submitted test message:

real    2m23.087s
user    2m22.083s
sys 0m0.944s

jstedfast commented 1 year ago

Here's the modified message that puts all of the addresses in a single To: header.

testMail2.txt

jstedfast commented 1 year ago

With the above patch applied, I now get:

real    1m54.082s
user    1m53.201s
sys 0m0.841s

jstedfast commented 1 year ago

I think I know how to fix the rest of the performance problem... it'll just take some time because I don't have a lot of free time to work on this.

jstedfast commented 1 year ago

Nevermind, wasn't that hard.

sigasigasiga commented 1 year ago

Wow, that's really impressive! Thank you!

jstedfast commented 1 year ago

New timings:

real    0m0.088s
user    0m0.047s
sys 0m0.024s

jstedfast commented 1 year ago

Oops, fixed the build (I had accidentally committed part of another local optimization idea that went unused)

sigasigasiga commented 1 year ago

@jstedfast would you mind creating a new release with this fix in it?

jstedfast commented 1 year ago

released 3.2.13

sigasigasiga commented 1 year ago

Thank you very much!

flokli commented 1 year ago

Did this change some behaviour? Running the notmuch testsuite with 3.2.13 exposes the following failure:

https://github.com/NixOS/nixpkgs/pull/187365#issuecomment-1220596066

jstedfast commented 1 year ago

It shouldn't have

dkg commented 1 year ago

@flokli is correct, this does break the notmuch test suite, see #129 for more detail

dkg commented 1 year ago

I recommend re-opening this issue, because i don't think the changes made to resolve matters addressed the underlying problem; rather, they just worked optimized for a specific sample message.

In particular, i think the problem is that the interaction in gmime-message.c beween message_update_addresses and process_header is a O(N²) operation.

I might be misunderstanding the code here, but it looks to me like as a parser passes through a message's header fields, it invokes message_header_added for each header (that is, each (name, value) pair it encounters in the message's header section), which in turn invokes process_header.

In the old code, which invoked only message_update_addresses when a specific address field was present, each time this is invoked, it clears the internal list of addresses, and then walks the entire list of headers in the message to find every matching element.

So if you have one Cc field, loading it causes a single scan of all headers, plucking out the contents of the relevant field. but if you have multiple Cc fields, each time the parser encounters a one of them, it does another full scan through all headers. So if you have N Cc fields, there will be N invocations of the message_header_added → process_header → message_update_addresses chain, and each of these inner functions will in turn do a full O(N) scan of the entire header list known by the message.

So it's not surprising that a message with 1000 Cc fields might take on the order of 1000000× the compute power to parse compared to a message with only one Cc field. I'm not sure how to fix this code, because it's not clear to me what the differences are between message->addrlists[type] and the contents of message->headers. If message->addrlists[type] is supposed to be some sort of consolidated, cached version describing all of the headers of field type, then what are the cache coherency expectations from this codebase? Do we really need to re-scan all headers each time another header comes in?

jstedfast commented 1 year ago

In particular, i think the problem is that the interaction in gmime-message.c beween message_update_addresses and process_header is a O(N²) operation.

With the old code, that was exactly the problem. The process_header() method called message_update_addresses() method for each and every Cc header that was added to the message headers.

The new logic checks to see if the header was appended (vs inserted), and if it was appended, it calls message_add_addresses() which only parses the new header and appends it to the existing InternetAddressList. The message_update_addresses() method is only called now if the header was inserted (because inserting new Cc addresses into the InternetAddressList in the appropriate place is impossible in that we currently do not hold enough state to know where the appropriate location to insert them would be).

Thus, the new implementation is O(N) instead of O(N²). This was done in commit 4a80ae527df9aa36fe50fac0878207a31d4d6b72.

dkg commented 1 year ago

I understand that the "new implementation" should be O(N), but the "old implementation" is still there, right?

What's stopping it from getting triggered is the logic in header_was_appended which only triggers if the header loaded happens to be the final item in object->headers -- so this is basically trying to detect whether the code is being run from within the current implementation of g_mime_parser_construct_message, without explicitly doing that detection.

This feels a bit like it's laying a small booby trap in the code -- a reader can only understand why it's doing it this way if they see how the different pieces fit together, and it's likely to be brittle if someone makes a change (to either piece of code).

If someone tries to build a message from scratch in a weird way, for example, by building a header block that ends in Subject: but then inserting an address header just before it, they'll still run into the same O(N²) behavior. (granted, since RFC 5322, they shouldn't be inserting multiple address headers of the same name anyway, so maybe this is a "don't do that then" situation)

I don't really have a good way forward to propose concretely, so i'll understand if you just decide to leave it as-is. I just wanted to make sure i'm understanding the situation clearly. Thanks for following up about this!

jstedfast commented 1 year ago

What's stopping it from getting triggered is the logic in header_was_appended which only triggers if the header loaded happens to be the final item in object->headers -- so this is basically trying to detect whether the code is being run from within the current implementation of g_mime_parser_construct_message, without explicitly doing that detection.

(Mostly) True, but that particular use-case is the hardest-hit just due to the fact that the parser adds a whole slew of headers to the message in a loop (and in particular, the address headers which must be combined).

The other case this handles (hence the "Mostly" above) is developers doing g_mime_header_list_append(message->headers, "Cc", "Charles <charles@example.com>", NULL) and/or g_mime_header_list_set(message->headers, "Cc", "Charles <charles@example.com>", NULL) where the header being set (in this case, Cc) doesn't exist.

As you correctly point out, if they do g_mime_header_list_prepend(message->headers, ...), this starts to hit O(N²) again (for address-type headers).

That said, there probably aren't many GMime developers who are doing that. Most likely just let GMimeMessage handle serialization of addresses.

The exception to this is most likely to be prepending "Received" headers (which I know GMime is used for outside of notmuch and other user-facing email clients). Luckily, GMime will not hit any sort of O(N²) situation for Received headers.

granted, since RFC 5322, they shouldn't be inserting multiple address headers of the same name anyway, so maybe this is a "don't do that then" situation)

I actually happen to like the older RFCs for allowing multiple To/Cc/Bcc/etc headers, I think having multiple headers helps to make the raw headers look nicer than 1 address header with a huge list of folded addresses. But that's just my quirkiness, I'm sure :-)

I don't really have a good way forward to propose concretely, so i'll understand if you just decide to leave it as-is. I just wanted to make sure i'm understanding the situation clearly.

There's 2 more things we can do that I've already done in MimeKit:

Instead of having the parser add headers 1 at a time to the GMimeMessage & GMimeObjects, MimeKit passes the equivalent of the full GMimeHeaderList to the new GMimeObject which can then initialize its cached header values.
"Lazy-load" (or, more accurately, lazy-parse/lazy-decode) the GMimeHeader(s) only when requested via APIs like g_mime_message_get_subject(), g_mime_message_get_to(), etc.

The first optimization is what helped make MimeKit about as fast as GMime back in the early days of MimeKit development.

The second optimization is more recent (part of MimeKit v3.0 development) and made a pretty massive performance improvement in MimeKit's MimeParser when I implemented that (not really a surprise seeing as how it defers a ton of expensive parsing until later when the user actually wants to get access to those parsed addresses/etc).

That work was done as part of this optimization effort: https://github.com/jstedfast/MimeKit/issues/695#issuecomment-901551896

I'm sure there's a ton more optimizations we could pull from that since MimeKit and GMime are so similarly designed.

But anyway, yea... current MimeKit v3.x versions have 3 parsers:

The old MimeParser.cs which was more-or-less based on GMimeParser
MimeReader.cs which is more SAX-like and restructured a few things to improve performance
ExperimentalMimeParser.cs which is the newer generation built on top of MimeReader.

The thread that I linked above has a bunch of benchmark results that show the gains made by the newer parsers.

Thanks for following up about this!

jstedfast / gmime

InternetAddress parsing is unacceptably slow #126