Closed digama0 closed 8 years ago
I have two suggestions for how to add line annotations:
parser::Span
to include line info (line and char, as well as byte index). There are potentially quite a few Span
s generated, so this could be a lot of extra info. Alternatively, just track line/char for the spans used in diagnostics (rather than for every token). Still need a way to get the info out of the parser as it arrives.LineData
structure, which maps lines to indexes. This can be used to look up a line number given the byte indexes stored already, on demand.I originally tried the first method but I am now leaning toward the second after realizing how prevalent Span
is for tokens everywhere. The LineData
structure needs about 4 bytes per line. If it is stored as an array of line start indexes, though, it will perform poorly w.r.t adding data in the middle of the file in incremental mode, because all the indexes after the addition must be shifted. It may be possible to store instead line lengths, in a binary tree with internal nodes annotated by the sum of the children, in order to quickly calculate the total number of characters in any subset of lines.
Any thoughts?
Last time I measured this, smetamath allocates 17 MB of Span
structs, representing 21% of the memory which is consumed over the baseline of just loading the file. So I've been looking at ways to make them smaller.
LineData
is the SMM2 approach which I'm planning to bring forward. It can be done with the same level of incrementality as the parser itself, I'm fairly sure.
You might not actually want an array of line starts, because making such an array will be rather hard on branch prediction. Rather, in the post-MacOS-EOL world, the line number at offset X in buf
is just buf.iter().filter(|x| **x == 10).count()
. That loop still generates a conditional, but slight variations on it don't.
Actually I've gotten line-counting loops like the one below to run at 12–13 GB/s, so a line cache is not entirely necessary, although it would improve responsiveness if you make an edit that results in 100+ notes.
fn lines(mut buf: &[u8]) -> usize {
let mut count = 0usize;
while buf.len() >= 128 {
let ps: i8 = buf[0..128].iter().map(|&x| if x == b'\n' { -1i8 } else { 0i8 }).sum();
count += ps.wrapping_neg() as u8 as usize;
buf = &buf[128..];
}
let ps: i8 = buf.iter().map(|&x| if x == b'\n' { -1i8 } else { 0i8 }).sum();
count += ps.wrapping_neg() as u8 as usize;
count
}
Wow, that's definitely doable then. It may help to encapsulate that function in a LineData server, though, which stores the last queried offset and bases the next query on it if the new offset is larger, and otherwise resets to the beginning. That way, in the common case when you need two offsets within a line or two, or many errors throughout the file (in increasing order), you don't have to make a thousand passes through the file to compute each line number.
EDIT: Why -1i8
? I'm guessing it has something to do with the two's complement representation. Is there a way to write that without a conditional? (There's also (x == b'\n') as u8
, which maps to 0
or 1
instead, but I don't know how these all get assembled.)
Currently errors are marked by byte offsets in the file, which is not very accessible for humans. It should report line numbers instead (and probably filenames as well, if processing a split up database). Line number ranges are probably not needed - the line of the left end of the range should be sufficient.