hxtk / Rust-Scanner

Advanced IO handler for Rust following the usage patterns of java.util.Scanner
MIT License
2 stars 2 forks source link

Fails to recognize delimiters that fall on the boundary between two buffers #4

Open hxtk opened 6 years ago

hxtk commented 6 years ago

This is related to Issue #3 in that both solutions require that we are able to read multiple buffers of information without consuming them.

Terminating Delimiters

Currently we search for terminating delimiters using the following algorithm:

buffer = first_buffer
while (buffer does not contain delimiter):
    append buffer to result string
    consume buffer.size()
    buffer = next_buffer
append buffer[0..match.start()] to result string
consume match.start()

Note that in the first test case below, the delimiter is (two spaces). The first buffer ends with one space and the second buffer begins with one space. Together, they form a delimiter, but this delimiter is not in either buffer, so the entirety of both buffers is read.

The best algorithm I can come up with, which follows, depends on rust-lang/regex#425; else we must make our own regex engine and implement hits_end() -> bool which returns true if the DFA is not in a dead state after we consume the last input character. This would imply that the buffer ends with a prefix to a word in that regex's language, and if we consume additional input it may or may not become a delimiter.

buffer = first_buffer
loop:
    while (buffer does not contain delimiter
            AND buffer does not end with delimiter prefix):
        append buffer to result string
        consume buffer.size()
        buffer = next_buffer
    while (buffer ends in delimiter prefix):
        buffer += next_buffer
    if (buffer contains delimiter):
        append buffer[0..match.start()] to result string
        consume match.start()
        break
# note that if the combined buffer didn't end up containing a delim,
# it will be appended to the result string on the next iteration of the loop at the top

Such functionality would also help us with the issue of parsing arbitrarily long starting delims. However, since there is nothing (simple) we can do to remedy that, we must consider an alternative. I propose the following:

while (result string does not contain delimiter):
    append buffer to result string
    buffer = next_buffer
consume match.start()
return result_string[0..match.start()]

Note that this requires reading multiple buffers without consuming them. By default, BufRead.fill_buf() will happily return the same buffer over and over again if you do not consume() between calls.

Precedent Delimiters

This has all of the same problems as the above problem, with one added caveat: it is impossible to tell the difference between a string that has no precedent delimiters and a string having a very long precedent delimiter—without being able to check for delimiter prefixes—except by reading to EOF.

This issue technically still exists in the case of trailing delimiters, but in that case there is no efficiency hit because reading to EOF if there is no trailing delimiter is actually the desired behavior.

Test Cases

/// This test will fail if we do not solve the above problem in a way that
/// preserves the tail of the original buffer, because in this test case the
/// terminating delimiter begins within the first buffer size and
/// ends within the second.
#[test]
fn buffer_ends_within_end_delim() {
    let string: &[u8] = b"foo  bar";
    let mut br = BufReader::with_capacity(4, string);
    let mut test = Scanner::new(&mut br);
    test.set_delim_str("  ");

    assert_eq!(test.next(), Some(String::from("foo")));
}

/// This test will fail if we cannot detect partial matches of the delimiter
/// when skipping over prefixed delimiters. Because the buffer size is 4, it
/// will read "aaaa", which is not in the language of /a+b/, however the
/// automaton is not in a dead state either: reading a "b" would put us in
/// an accepting state, thus we must read more input to know if the regex is
/// satisfied. Reading an additional character will result in "aaaab", which
/// is a valid delimiter in this language and should therefore be skipped.
#[test]
fn buffer_ends_within_start_delim() {
    let string: &[u8] = b"aaaabfoo";
    let mut br = BufReader::with_capacity(4, string);
    let mut test = Scanner::new(&mut br);
    test.set_delim(Regex::new(r"a+b").unwrap());

    assert_eq!(test.next(), Some(String::from("foo")));
}
hxtk commented 6 years ago

Note: the better algorithm above also requires the ability to read multiple buffers at a time, so it is not an improvement in that sense.

It is only better because the complexity of the regex matching is Ω(buffer size) while the second choice is Ω(max{(result string size), (buffer size)}). Combined with the fact that even the pseudocode is twice as long, I think this is premature optimization in the case of trailing delimiters.

In the case of precedent delimiters, it has a difference of Ω(buffer size) vs Ω(file size) in the case that there is no precedent delimiter, so it is still necessary for that case, especially in use cases where one scans an infinite stream (e.g., network programming) or files too large to fit in memory.

However, since they are precedent delimiters, we change the algorithm slightly in that case:

buffer = first_buffer
loop:
    while (buffer starts with delimiter):
        consume buffer.size()
        buffer = next_buffer
    while (whole buffer is delimiter prefix):
        buffer += next_buffer
    if (buffer does not start with delimiter):
        break
# note that if the combined buffer was a delim, it will be consumed
# back at the top
hxtk commented 6 years ago

That ignores greedy operators. Test case: buffer size: 4 delim: /a[ab]*b/ input string: aaabbc reference: c result: bc

You can fix this with a little reordering:

buffer = first_buffer
loop:
    while (whole buffer is delimiter prefix):
        buffer += next_buffer
    if (buffer starts with delimiter):
        consume buffer.size()
        buffer = next_buffer
    else:
        break
hxtk commented 6 years ago

I think that in order to read multiple buffer widths, we will have to implement BufRead on our Scanner and write our own buffer logic.

The buffer should function identically to BufReader in the general case, but it should use a vector instead of a fixed slice. Further, we will provide a method extend_buf() or something similar that will read buffer size bytes into the buffer, even if the buffer already has some contents, extending it if necessary.

As a consequence, any call to consume will invalidate the buffer, but that is okay because Rust already disallows that operation by requiring the borrow for extend_buf() has expired before calling consume()

After extending the buffer, it should be allowed to shrink back to normal size over time as calls to consume are made: we will not (nor can we without making some assumptions about the underlying Read) go our of our way to shrink it back to normal size.

hxtk commented 6 years ago

That buffer data structure is starting to sound kinda complicated to handle ad-hoc. I've created a separate issue #5 dealing with the data structure. This issue is on hold pending that issue's resolution.

hxtk commented 6 years ago

See abonander/buf_redux, which provides the necessary lookahead features. We will not have to implement our own BufRead after all.

hxtk commented 6 years ago

The last commit I pushed (36c64a3) contains breaking changes. I don't think they were necessary, but I'll have to reevaluate that in the morning.

hxtk commented 6 years ago

We still fail to detect precedent delims longer than the input buffer, but all trailing delims work.