Closed TeamDman closed 1 month ago
Hello @TeamDman.
First of all, thank you very much for the very detailed bug report and reproducers, that's very helpful.
This is a serious bug in the parser and the current implementation is even unsound because the bug is triggered in an unsafe
block.
After digging around a bit I was able to find the root cause for this.
The parser is accepting a &str
but internally works on a &[u8]
because this yields better performance. The input stream is guaranteed to be valid UTF-8, but since the token type is u8
(one bytes) instead of char
(multiple bytes), individual tokens are not UTF-8. You probably see where this is going 😉 .
These two function are the culpit here: https://github.com/martinohmann/hcl-rs/blob/main/crates/hcl-edit/src/parser/string.rs#L228-L236
They work on single bytes, checking if these are ID start/continue. This obviously excludes UTF-8 chars from the mix and the handling for these is broken. I completely overlooked this in the implementation phase.
There are now multiple alternatives for fixing this:
&str
instead of &[u8]
. This will avoid this issue, but also degrades parser performance.&[u8]
but do something similar to what toml_edit
does, which also works on bytes: since the input bytes originated from a &str
we know they are always valid UTF-8, so we can parse out runs of bytes, convert them into a &str
and defer the validity checks (is_id_start
/is_id_continue
) for the individual char
s until then.Alternative 1 is the easiest to implement, but i'd like to avoid harming parser performance too much if it can be avoided, so i'll also try to first explore variant 2, which is a bit more complex to get right, but should have better performance.
Here's a minimal reproducer comparing the current &[u8]
-based ident parser code with a variant that works on &str
:
mod issue_350_minimal {
use winnow::{
stream::AsChar,
token::{one_of, take_while},
PResult, Parser,
};
fn ident_bytes<'a>(input: &mut &'a [u8]) -> PResult<&'a str> {
(
one_of(|b: u8| hcl_primitives::ident::is_id_start(b.as_char())),
take_while(0.., |b: u8| {
hcl_primitives::ident::is_id_continue(b.as_char())
}),
)
.recognize()
.map(|s: &[u8]| {
print_bytes("parsed bytes with `u8` stream", s);
std::str::from_utf8(s)
.expect("`is_id_start` and `is_id_continue` filter out non-utf8")
})
.parse_next(input)
}
fn ident_str<'a>(input: &mut &'a str) -> PResult<&'a str> {
(
one_of(hcl_primitives::ident::is_id_start),
take_while(0.., hcl_primitives::ident::is_id_continue),
)
.recognize()
.map(|s: &str| {
print_bytes("parsed bytes with `char` stream", s.as_bytes());
s
})
.parse_next(input)
}
fn print_bytes(desc: &str, buf: &[u8]) {
print!("{desc}: ");
for b in buf {
print!("0x{b:02x} ");
}
println!();
}
#[test]
fn parse_ident() {
let mut input = "ééé";
let mut input_bytes = input.as_bytes();
print_bytes("input bytes", input_bytes);
let parsed = ident_str.parse(&mut input).unwrap();
assert_eq!(parsed, "ééé");
let parsed = ident_bytes.parse(&mut input_bytes).unwrap();
assert_eq!(parsed, "ééé");
}
}
Output:
running 1 test
test issue_350_minimal::parse_ident ... FAILED
failures:
---- issue_350_minimal::parse_ident stdout ----
input bytes: 0xc3 0xa9 0xc3 0xa9 0xc3 0xa9
parsed bytes with `char` stream: 0xc3 0xa9 0xc3 0xa9 0xc3 0xa9
parsed bytes with `u8` stream: 0xc3
thread 'issue_350_minimal::parse_ident' panicked at crates/hcl-edit/tests/regressions.rs:221:22:
`is_id_start` and `is_id_continue` filter out non-utf8: Utf8Error { valid_up_to: 0, error_len: None }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
issue_350_minimal::parse_ident
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 9 filtered out; finished in 0.00s
error: test failed, to rerun pass `-p hcl-edit --test regressions`
Btw: the reason why the following does not have the same issue is that it works on a &str
and not &[u8]
to parse the identifier, so the UTF-8 handling is correct here:
let ident = "ééé";
ident.parse::<hcl::edit::Ident>().unwrap();
@TeamDman thanks again for the bug report!
I just released the fix for this via hcl-edit
0.8.1 / hcl-rs
0.18.0.
Sadly, the fix decreased the parser performance by 20-30%, but I'm planning to iterate on that to bring performance levels back closer to where they were before. For now, a correct parser is more important than performance.
Funny enough, I also found another bug related to unicode in the error reporting and fixed that along the way.
Let me know if you spot other issues around unicode handling.
Woohoo! Props for the quick fix 🎉🐇
I'm using import blocks to generate hundreds of thousands of lines of HCL, so I definitely appreciate the care being placed on the performance
Thank you!
tl;dr at bottom section
backtrace
``` running 1 test test reflow::tests::utf8_problem ... FAILED successes: successes: failures: ---- reflow::tests::utf8_problem stdout ---- thread 'reflow::tests::utf8_problem' panicked at C:\Users\teamy\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hcl-edit-0.8.0\src\parser\string.rs:243:36: `is_id_start` and `is_id_continue` filter out non-utf8: Utf8Error { valid_up_to: 0, error_len: None } stack backtrace: 0: std::panicking::begin_panic_handler at /rustc/79734f1db8dbe322192dea32c0f6b80ab14c4c1d/library\std\src\panicking.rs:652 1: core::panicking::panic_fmt at /rustc/79734f1db8dbe322192dea32c0f6b80ab14c4c1d/library\core\src\panicking.rs:72 2: core::result::unwrap_failed at /rustc/79734f1db8dbe322192dea32c0f6b80ab14c4c1d/library\core\src\result.rs:1654 3: enum2$Terraform and OpenTofu support labels beginning with utf-8
When I use
terraform plan -generate-config-out="generated.tf"
followed byterraform init
andterraform apply
it works with the special characters in the identifier, so Terraform supports it but hcl-rs doesn't D:also tried
crates:
relevant:
https://github.com/martinohmann/hcl-rs/blob/48de0e7b524cf8619fbf97a9214cc924ecca4697/crates/hcl-edit/src/parser/structure.rs#L63
https://github.com/martinohmann/hcl-rs/blob/48de0e7b524cf8619fbf97a9214cc924ecca4697/crates/hcl-edit/src/parser/string.rs#L220-L226
https://github.com/martinohmann/hcl-rs/blob/48de0e7b524cf8619fbf97a9214cc924ecca4697/crates/hcl-edit/src/parser/string.rs#L193-L200
https://github.com/martinohmann/hcl-rs/blob/48de0e7b524cf8619fbf97a9214cc924ecca4697/crates/hcl-primitives/src/ident.rs#L277-L309
tangent
The Hashicorp [spec](https://github.com/hashicorp/hcl/blob/212a40e528766634a1aa6dd1e820d7936762196e/hclsyntax/spec.md#identifiers) says > ### Identifiers > Identifiers name entities such as blocks, attributes and expression variables. Identifiers are interpreted as per [UAX #31][uax31] Section 2. Specifically, their syntax is defined in terms of the `ID_Start` and `ID_Continue` character properties as follows: > ```ebnf Identifier = ID_Start (ID_Continue | '-')*; ``` > The Unicode specification provides the normative requirements for identifier parsing. Non-normatively, the spirit of this specification is that `ID_Start` consists of Unicode letter and certain unambiguous punctuation tokens, while `ID_Continue` augments that set with Unicode digits, combining marks, etc. > The dash character `-` is additionally allowed in identifiers, even though that is not part of the unicode `ID_Continue` definition. This is to allow attribute names and block type names to contain dashes, although underscores as word separators are considered the idiomatic usage. [uax31]: http://unicode.org/reports/tr31/ "Unicode Identifier and Pattern Syntax" We can search the repo for ID_START https://github.com/search?q=repo%3Ahashicorp%2Fhcl%20ID_START&type=code ![image](https://github.com/martinohmann/hcl-rs/assets/9356891/26333e7f-e495-4657-addc-8b75899d3039) which reveals the truth for what is supported https://github.com/hashicorp/hcl/blob/212a40e528766634a1aa6dd1e820d7936762196e/hclsyntax/unicode_derived.rl preview: ```rs ID_Start = 0x41..0x5A #L& [26] LATIN CAPITAL LETTER A..LATIN CAPI... | 0x61..0x7A #L& [26] LATIN SMALL LETTER A..LATIN SMALL ... | 0xC2 0xAA #Lo FEMININE ORDINAL INDICATOR | 0xC2 0xB5 #L& MICRO SIGN | 0xC2 0xBA #Lo MASCULINE ORDINAL INDICATOR | 0xC3 0x80..0x96 #L& [23] LATIN CAPITAL LETTER A WITH GRAVE.... ... a whole bunch more ``` Looks like the [`unicode_ident`](https://rtic.rs/stable/api/unicode_ident/index.html) crate used by `hcl-rs` is also targetting the [UAX #31][uax31] spec. > Implementation of [Unicode Standard Annex #31](https://www.unicode.org/reports/tr31/) for determining which char values are valid in programming language identifiers. I tried asking ChatGPT if "é" is a valid ID_START or ID_CONTINUE character and it said both yes and no when I restarted the chat, so I'm not sure where the spec deviates, it's either Terraform+OpenTofu or the `unicode_ident` crate. ... I asked again and it gave the following ```python import regex char = 'é' # Check if it's a valid ID_Start character is_id_start = bool(regex.match(r'\p{ID_Start}', char)) # Check if it's a valid ID_Continue character is_id_continue = bool(regex.match(r'\p{ID_Continue}', char)) is_id_start, is_id_continue ``` > True, True https://chatgpt.com/share/e910f089-fee1-47ad-bdbe-e988cde57aeeTL;DR - parsing fails for some reason and I have no idea why
backtrace
``` Finished `test` profile [unoptimized + debuginfo] target(s) in 0.08s Running unittests src/main.rs (target\debug\deps\unicode_ident_uax_31-ae869406db3b55a7.exe) running 4 tests test test::hcl_edit_ident_test ... ok test test::hcl_primitives_test ... ok test test::unicode_ident_test ... ok test test::hcl_edit_body_test ... FAILED failures: ---- test::hcl_edit_body_test stdout ---- thread 'test::hcl_edit_body_test' panicked at C:\Users\teamy\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hcl-edit-0.8.0\src\parser\string.rs:243:36: `is_id_start` and `is_id_continue` filter out non-utf8: Utf8Error { valid_up_to: 0, error_len: None } stack backtrace: 0: std::panicking::begin_panic_handler at /rustc/79734f1db8dbe322192dea32c0f6b80ab14c4c1d/library\std\src\panicking.rs:652 1: core::panicking::panic_fmt at /rustc/79734f1db8dbe322192dea32c0f6b80ab14c4c1d/library\core\src\panicking.rs:72 2: core::result::unwrap_failed at /rustc/79734f1db8dbe322192dea32c0f6b80ab14c4c1d/library\core\src\result.rs:1654 3: enum2$