RFC: Parsing record contents directly

janik-cloudflare commented 8 months ago

We often need to construct an RR from its name, type, TTL and content (for example, when reading records from the database or from user API requests.)

We currently use something similar to NewRR(fmt.Sprintf("%s %d IN %s %s", name, ttl, type, content)), but that requires a fair amount hacky extra work:

The content (or even the name!) could contain semicolons, which would be interpreted as starting a comment.
The content could contain newlines, causing the A record content 1.2.3.4\n5.6.7.8 to parse without errors (as 1.2.3.4).
Perhaps it would even be possible to inject an $INCLUDE somehow, so we need to disable that as a precaution.
We have to deal with special cases around quoting and escaping.
We ignore leading and trailing whitespace, meaning fe80::1\t is considered a valid AAAA record content.
etc.

It's also currently impossible to parse generic (RFC 3597) records into a *dns.RFC3597 when the underlying type is known to this library because NewRR always returns the known type's struct in that case, even when the content started with a \# token. ToRFC3597 can be used to convert it back to a *dns.RFC3597, but that calls pack(), which changes behavior depending on the underlying type. (For example, calling NewRR with . 1 IN TYPE9999 \# 3 abcdef produces Rdata \# 3 abcdef as expected, whereas changing TYPE9999 to TYPE48 results in \# 4 abcdef00 instead.)

This is far from ideal for our use case, so I've taken a stab at trying to build something more reasonable, which would require additional public interface. In short, I've been trying to build something that parses only the record's content, without any special zone file syntax.

My first thought was that this would be fairly simple to do with a single function (this one isn't super clean; consider it an example or first draft):

func ParseRdata(newFn func() RR, h *RR_Header, origin string, r io.Reader) (RR, error) {
       if origin != "" {
               origin = Fqdn(origin)
               if _, ok := IsDomainName(origin); !ok {
                       return nil, &ParseError{"", "bad initial origin name", lex{}}
               }
       }

       c := newZLexer(r)

       var rr RR

       parseAsRFC3597 := newFn == nil

       // If a newFn function was provided but the content starts with
       // a `\#` token, ignore newFn and parse as RFC 3597 regardless.
       if c.Peek().token == `\#` {
               parseAsRFC3597 = true
       }

       if parseAsRFC3597 {
               rr = &RFC3597{Hdr: *h}
       } else {
               rr = newFn()
               *rr.Header() = *h
       }

       if err := rr.parse(c, origin); err != nil {
               // err is a concrete *ParseError without the file field set.
               // The setParseError call below will construct a new
               // *ParseError with file set to zp.file.

               // err.lex may be nil in which case we substitute our current
               // lex token.
               if err.lex == (lex{}) {
                       return nil, &ParseError{err: err.err}
               }

               return nil, &ParseError{"", err.err, err.lex}
       }

       if err := c.Err(); err != nil {
               return nil, err
       }

       n, ok := c.Next()

       return rr, nil
}

This gets us most of the way there, but I realized the lexer is also context-aware and also parses comments. The former is easily worked around by initializing the zlexer with owner: false, which makes it expect content instead of an owner directive. The latter is more problematic and seems to require an entirely new, purpose-built lexer. Since RR.parse() expects a *zlexer, that would either involve making zlexer an interface, or adding the new logic to the existing zlexer struct. In addition, I noticed all RR.parse() functions call slurpRemainder(), which would then possibly have to be moved out.

I'd be happy to work on this but before I spend too much time I wanted to ask whether this is a change you'd consider merging at all (hopefully), and, if so, whether someone with a deeper understanding of the parser and lexer has any design thoughts for me. I know you all tend to be conservative when it comes to adding new public interface (which is good), but I think this could be a useful addition that would enable use cases that are otherwise very difficult to achieve correctly.

miekg commented 8 months ago

I know where you are coming from and feel your pain. I tried doing something similar for CoreDNS's caching back in the day. Before I read your proposal in much more detail. Note that:

NewRR is expensive in itself - you might also just want to create an RR without parsing some stringified rdata, i.e. without NewRR - I never found a good Go api for that - although now maybe with Generics there might be a clean way?
Does this proposal imply that we also want to be able to write out and read in RDATA in wireformat? (as a more efficient way of doing this)?

miekg commented 8 months ago

ok, read it.

I see. Let me be clear that the RFC 1035 zone format is a horrendous format, even if you remove most of it, like $TTL and $ORIGIN and $INCLUDE and $GENERATE, you also still have ( and ')' which could potentially show up and newlines should be ignored there - or do you also want to ban those from parsing?

If we go back to the original task, it's: if have some bytes and I want those bytes to be the rdata of this dns.RR i have, so the question then becomes:

how do I store the bytes (and does this library need to care?)
how do I marry an dns.RR and this parsed rdata (here you prolly do need some support in this lib)

FWIW as eluded in my first comment, I never found a elegant way of doing the later in Go, do we need a dns.RData interface thingy, this was an attempt for work directory with rdata fields: https://pkg.go.dev/github.com/miekg/dns#Field (obviously that only reading them).

So dunno - open to ideas, but I think a customized parser that works on the text representation isn't the best thing, nor the speediest

janik-cloudflare commented 8 months ago

Thank you @miekg! You raise some very good points.

Initially I would've said ( ) should be allowed, but the more I think about it, the more they seem like zone file syntax rather than content syntax, if that makes sense. Really, maybe all we need is the ability to parse quotes and backslash escapes (and skip whitespace). That still leaves us with a lot of type-specific behavior, though: net.ParseIP, toAbsoluteName, the occasional Base64 decoding, and the nightmare that is LOC parsing.

It seems the only lexer function called by each type in scan_rr.go is Next(), so I wonder if we could somehow turn that into an interface and find an elegant way to inject a different "lexer" behind the scenes (whether an actual lexer or something simpler that just produces the right data at the right time). I'll definitely need to think about that some more though.

Thanks for all the feedback so far!

miekg commented 7 months ago

Coming back to this... still on the dunno. Had kept it in the back of my mind, and my question is. Why is this data in txt format in the first place? It makes more sense to have this in (uncompressed!) wire format, parse that back in and put the RR metadata on it, and voila -> RR

janik-cloudflare commented 6 months ago

In our case it's coming from the API or a database column that needs to be searchable text. In most other cases, I would agree that storing wire format data is better.

Maybe ideally we'd have something similar to #1507 that allows parsing to work without a hard dependency on the lexer, but that is obviously a larger change. I'm still hoping to get a PR up at some point for you to review but if you don't object, I'm going to close this issue in the meantime since it's not really a bug and more of a "request for your thoughts".

Thank you for your input, Miek!

miekg commented 6 months ago

[ Quoting @.***> in "Re: [miekg/dns] RFC: Parsing record..." ]

In our case it's coming from the API or a database column that needs to be searchable text. In most other cases, I would agree that storing wire format data is better.

Maybe ideally we'd have something similar to #1507 that allows parsing to work without a hard dependency on the lexer, but that is obviously a larger change. I'm still hoping to get a PR up at some point for you to review but if you don't object, I'm going to close this issue in the meantime since it's not really a bug and more of a "request for your thoughts".

Thank you for your input, Miek!

Cheers!

The only take-away I currently have is that the next version of miekg/dns should have something for this.

What we can do (and want to do) with the current code, I still don't know.

/Miek

miekg / dns

RFC: Parsing record contents directly #1530