Closed dckc closed 8 years ago
I tried this (in my client code), but I got error: trait
InputBufferis private
.
/// Like `run_scanner` but without throwing away final state.
fn scan_aux<I: Copy, S: Copy, F>(i: Input<I>, s: S, mut f: F) -> SimpleResult<I, (&[I], S)>
where F: FnMut(S, I) -> (S, bool) {
use chomp::input::InputBuffer;
let b = i.buffer();
let mut state = s;
match b.iter().position(|&c| { let (v, cont) = f(state, c);
if cont { state = v; false }
else { true } }) {
Some(n) => i.replace(&b[n..]).ret((&b[0..n], state)),
// TODO: Should this following 1 be something else, seeing as take_while1 is potentially
// infinite?
None => i.incomplete(1),
}
}
InputBuffer
is not private, it is just exposed through the primitives
module instead of the input
module directly. This is to avoid using anything private from the input
module in Chomp itself (ie. Chomp should be implemented with the same limitations as any third-party combinators/parsers).
As for parsing a single UTF-8 character: is there any specific reason why you would want to use scan
for this? If you are parsing a string you should probably just treat it as a slice of bytes, and then use std::str::from_utf8
to convert it to a string (combined with your own error type to wrap both Utf8Error
and chomp::Error
, this will make it pretty flexible while still keeping it zero-copy).
If you are looking for parsing a single character into a char
, then making a specific parser for that would be suitable. Such a parser would be useful as a part of the Chomp library itself even.
I am parsing a single character into a char
.
This is what I managed to get working (before I saw your clue about the primitives
module), though the repeated parsing is far from ideal. I'm not sure 4 is the longest representation of a char in utf8, either:
fn utf8_char(i: Input<u8>) -> U8Result<char> {
fn validate_char<'a>(i: Input<'a, u8>, bs: &'a [u8]) -> U8Result<'a, char> {
let ss = if bs.len() > 0 { str::from_utf8(bs).ok() } else { None };
if let Some(s) = ss {
let ch = s.chars().next().unwrap();
i.ret(ch)
} else {
i.err(chomp::parsers::Error::new())
}
}
or(i, |i| take(i, 1).bind(validate_char),
|i| or(i, |i| take(i, 2).bind(validate_char),
|i| or(i, |i| take(i, 3).bind(validate_char),
|i| take(i, 4).bind(validate_char))))
}
FWIW, the string parser is more straightforward, though if there's a more concise/idiomatic way to do this sort of validation, I'd be interested to know:
fn utf8_str(i: Input<u8>) -> U8Result<String> {
fn check<'a>(i: Input<'a, u8>, bs: &'a [u8]) -> U8Result<'a, String> {
match String::from_utf8(bs.to_owned()) {
Ok(s) => i.ret(s),
Err(_) => i.err(chomp::parsers::Error::new()) // TODO: nicer error?
}
};
parse!{i;
let len = var_int(); // TODO: check for int overflow
let s = i -> take(i, len as usize).bind(check);
ret s
}
}
It looks pretty close to what I would do currently. At a later point there will probably be more tools for dealing with UTF-8 in Chomp.
One thing though, since you do not have anything in particular to parse from the string itself (eg. escape sequences) you can use std::str::from_utf8
instead of String::from_utf8
to prevent an allocation. Of course this will tie the lifetime to the input slice, but that might not be an issue depending on usage.
@dckc Is this solved?
Yes, I expect the clue about the primitives
module solves this. I haven't tested it, though.
I'm trying to parse one utf8 character. I tried
run_scanner
andstd::char::from_u32
, but it doesn't work because when I get a whole character, the way to signal it is to return None, which throws away the state.