Marcelo, thanks for this lib, very helpful. I've written a parser for rfc5646 which is going well but there is one issue where the parser and the grammar don't seem to agree :-)
langtag = language
["-" script]
["-" region]
*("-" variant)
*("-" extensions)
["-" privateuse]
language = 2*3ALPHA ; shortest ISO 639 code
["-" extlang] ; sometimes followed by
; extended language subtags
/ 4ALPHA ; or reserved for future use
/ 5*8ALPHA !!! # or registered language subtag
return_value(:language, state, string_values, values, rule)
!!!
extlang = 3ALPHA ; selected ISO 639 codes
*2("-" 3ALPHA) ; permanently reserved
script = 4ALPHA !!! # ISO 15924 code
return_value(:script, state, string_values, values, rule)
!!!
The issue is that the optional ["-" extlang] part of language is 3ALPHA and hence will always be preferred over the langtagscript which is 4ALPHA. Any thoughts on how to make the match more "greedy"? Or a suggestion on how to rewrite this part of the grammar?
An example of the error is parsing the valid language tag below where you can see the parser takes the shortest match which is VALE matching the script definition - not the extlang definition.
iex> Cldr.LanguageTag.parse("es-VALENCIA")
{:error,
{Cldr.InvalidLanguageTag,
"Could not parse language tag. Error was detected at 'ncia'"}}
Marcelo, thanks for this lib, very helpful. I've written a parser for rfc5646 which is going well but there is one issue where the parser and the grammar don't seem to agree :-)
The issue is that the optional
["-" extlang]
part oflanguage
is3ALPHA
and hence will always be preferred over thelangtag
script
which is4ALPHA
. Any thoughts on how to make the match more "greedy"? Or a suggestion on how to rewrite this part of the grammar?An example of the error is parsing the valid language tag below where you can see the parser takes the shortest match which is
VALE
matching thescript
definition - not theextlang
definition.The full parser is on github