Question about parsing X-GM-LABELS

gaynetdinov commented 5 months ago

Hey @nevans!

I wanted to share one more imap hack we have in our codebase. This time it was introduced by me in order to deal with inconsistencies that are coming from X-GM-LABELS.

The issue is that people, in fact, can put whatever they want as a label name. Here is example response from my test Gmail account where I created some nonsense labels.

C: RUBY0004 UID FETCH 6 (UID ENVELOPE BODYSTRUCTURE FLAGS X-GM-LABELS)
S: * 3 FETCH (X-GM-LABELS ("()" "(foobar)" "[foobar]" "\\Draft" "\\Important" "\\Starred" fav) UID 6 FLAGS (\Flagged \Seen) ENVELOPE (...

Since our goal was only to determine whether an email is draft email or not, I came up with the following method for net-imap v0.3.7

FLAG_REGEXP = /\
(?# FLAG        )\\([^\x80-\xff(){ \x00-\x1f\x7f%"\\]+)|\
(?# ATOM        )([^\x80-\xff(){ \x00-\x1f\x7f%*"\\]+)/n

  # This method is mimic of flags_data method, but with some serious regex
  # to match labels inside parenthesis that can contain parenthesis in their names.
  def x_gm_labels
    @flag_symbols = {}
    matched = @str.match(/\(((?:"(?:\\.|[^"\\])*"|[^()\s]*)(\s+(?:"(?:\\.|[^"\\])*"|[^()\s]+))*)\)/ni, @pos)

    if matched
      @pos = matched.end(0)

      # The existing FLAG_REGEXP is not prepared to extract labels/flags like
      # "a b c" correctly. It extracts it as 3 flags "a", "b" and "c".
      # But the good news is that the whole X-GM-LABELS story is about checking
      # if there is "\\Draft" FLAG among those so that we would know
      # that it should not be synced.
      # So inside this patched version of `flags_data` method I'd ignore completely
      # ATOMs and keep only FLAGs.
      matched[1].scan(FLAG_REGEXP).collect { |flag, _atom|
        next unless flag

        symbol = flag.capitalize.intern
        @flag_symbols[symbol] = true
        if @flag_symbols.length > 10_000
          raise FlagCountError, "number of flag symbols exceeded"
        end

        symbol
      }.compact
    else
      parse_error("invalid flag list")
    end
  end

Since v0.3.7 net-imap got rid of FLAG_REGEXP and parsing of X-GM-LABELS is done differently now, but I believe net-imap still would not be able to parse list of Gmail's labels if they contain parentheses or some other special characters.

I've seen list of label from X-GM-LABELS that would look like this (X-GM-LABELS ("Junkmail (filtered)" "[Gmail] (2)/Prod (test)" "\\Draft" "\\Important")

I'm not even sure if one can come up with a regexp that would match any potential label name in the list of X-GM-LABELS. While for our app we don't really need to extract correctly the whole list of X-GM-LABELS I'm curious to make it work. The thing confuses me right now is that in the tests for X-GM-LABEL in net-imap I see this

test_fetch_msg_att_X-GM-LABELS_1:
    :comments: |
      Example copied from https://developers.google.com/gmail/imap/imap-extensions
    :response: "* 1 FETCH (X-GM-LABELS (\\Inbox \\Sent Important \"Muy Importante\"))\r\n"
    :expected: !ruby/struct:Net::IMAP::UntaggedResponse
      name: FETCH
      data: !ruby/struct:Net::IMAP::FetchData
        seqno: 1
        attr:
          X-GM-LABELS:
          - :Inbox
          - :Sent
          - Important
          - Muy Importante
      raw_data: "* 1 FETCH (X-GM-LABELS (\\Inbox \\Sent Important \"Muy Importante\"))\r\n"

So the FLAG labels like \\Inbox or \\Sent are not wrapped into double quotes, but when I fetch them from Gmail they are actually wrapped (see my example above). In my experience only simple names like fav in the example above are not wrapped into ".

Perhaps I'm missing something? Anyways, your feedback or help would be highly appreciated! Thank you for your time!

nevans commented 5 months ago

Sorry for the slow response. Yeah, that's a difference between their documented and actual behavior. Their actual behavior is ...weird. :) They don't provide ABNF, but I infer from the docs that their grammar is (or was): x-gm-label = flag / astring. Net::IMAP consistently converts flag tokens (eg: \atomchars with no quotes) into symbols and atom, astring, quoted, or literal tokens into strings. And I'd prefer that Net::IMAP remain consistent on this, rather than smooth over provider weirdness. There's no semantic difference between word (atom), "word" (quoted) or {4}\r\nword (literal), but we do need to distinqish between \word (flag) and "\\word" (quoted).

See https://github.com/ruby/net-imap/blob/119f43afbef2d0158d0aea68d493c105e56a6c4b/lib/net/imap/response_parser.rb#L1948-L1959

So, I think the only thing you would need change in your code is to check for "\\Draft" in addition to (or instead of) :Draft. Alternatively, you could create a monkey patch like:

class Net::IMAP::ResponseParser
  module PatchXGMLabelsAsSymbols
    private
    def x_gm_labels = super.map { _1.start_with?("\\") ? _1[1..].capitalize.to_sym : _1 }
  end
  prepend PatchXGMLabelsAsSymbols
end

But that of course comes with all of the standard warnings and caveats about the issues with monkeypatching. 😉

nevans commented 5 months ago

What I'm guessing happened was that they originally wanted to send their localizable system labels (XLIST) as flags and user provided (not localizable) labels as astrings, but then (I'm guessing) they realized that the RFC has this to say about flag-extensions:

Server implementations MUST NOT generate flag-extension flags except as defined by a future Standard or Standards Track revisions of this specification.

...so they had to convert their flags to astrings. But they still need to keep the distinction between their localizable system (XLIST) labels and user-provided (not localizable) labels, so they retained the leading \ inside the astring. But then they never updated the docs! (The docs do claim that the labels are all astrings, so maybe they partially updated them?) And also, they long ago deprecated XLIST in favor of the standard SPECIAL-USE, so (in as much as that's what they are returning) they really should return any SPECIAL-USE flags as flag tokens (not astring).

However, changing the returned values would require incrementing UIDVALIDITY on all affected mailboxes (and clients that depend on the current behavior) so they probably just figured it wasn't worth changing. And anyway, they have a much bigger bug on their PERMANENTFLAGS response which they haven't fixed for two decades now, so... :)

ruby / net-imap

Question about parsing X-GM-LABELS #272