marcelog / ex_abnf

Parser for ABNF Grammars
Apache License 2.0
61 stars 12 forks source link

Issue parsing rfc5646 (language tags) #14

Closed kipcole9 closed 8 months ago

kipcole9 commented 6 years ago

Marcelo, thanks for this lib, very helpful. I've written a parser for rfc5646 which is going well but there is one issue where the parser and the grammar don't seem to agree :-)

langtag       = language
                ["-" script]
                ["-" region]
                *("-" variant)
                *("-" extensions)
                ["-" privateuse]

language      = 2*3ALPHA            ; shortest ISO 639 code
                ["-" extlang]       ; sometimes followed by
                                    ; extended language subtags
                / 4ALPHA            ; or reserved for future use
                / 5*8ALPHA  !!!     # or registered language subtag
                  return_value(:language, state, string_values, values, rule)
                !!!

extlang       = 3ALPHA              ; selected ISO 639 codes
                *2("-" 3ALPHA)      ; permanently reserved

script        = 4ALPHA  !!!         # ISO 15924 code
                  return_value(:script, state, string_values, values, rule)
                !!!

The issue is that the optional ["-" extlang] part of language is 3ALPHA and hence will always be preferred over the langtag script which is 4ALPHA. Any thoughts on how to make the match more "greedy"? Or a suggestion on how to rewrite this part of the grammar?

An example of the error is parsing the valid language tag below where you can see the parser takes the shortest match which is VALE matching the script definition - not the extlang definition.

iex> Cldr.LanguageTag.parse("es-VALENCIA")                        
{:error,
 {Cldr.InvalidLanguageTag,
  "Could not parse language tag.  Error was detected at 'ncia'"}}

The full parser is on github