Open hxtk opened 6 years ago
Note: the better algorithm above also requires the ability to read multiple buffers at a time, so it is not an improvement in that sense.
It is only better because the complexity of the regex matching is Ω(buffer size) while the second choice is Ω(max{(result string size), (buffer size)}). Combined with the fact that even the pseudocode is twice as long, I think this is premature optimization in the case of trailing delimiters.
In the case of precedent delimiters, it has a difference of Ω(buffer size) vs Ω(file size) in the case that there is no precedent delimiter, so it is still necessary for that case, especially in use cases where one scans an infinite stream (e.g., network programming) or files too large to fit in memory.
However, since they are precedent delimiters, we change the algorithm slightly in that case:
buffer = first_buffer
loop:
while (buffer starts with delimiter):
consume buffer.size()
buffer = next_buffer
while (whole buffer is delimiter prefix):
buffer += next_buffer
if (buffer does not start with delimiter):
break
# note that if the combined buffer was a delim, it will be consumed
# back at the top
That ignores greedy operators. Test case:
buffer size: 4
delim: /a[ab]*b/
input string: aaabbc
reference: c
result: bc
You can fix this with a little reordering:
buffer = first_buffer
loop:
while (whole buffer is delimiter prefix):
buffer += next_buffer
if (buffer starts with delimiter):
consume buffer.size()
buffer = next_buffer
else:
break
I think that in order to read multiple buffer widths, we will have to implement BufRead
on our Scanner
and write our own buffer logic.
The buffer should function identically to BufReader in the general case, but it should use a vector instead of a fixed slice. Further, we will provide a method extend_buf()
or something similar that will read buffer size
bytes into the buffer, even if the buffer already has some contents, extending it if necessary.
As a consequence, any call to consume
will invalidate the buffer, but that is okay because Rust already disallows that operation by requiring the borrow for extend_buf()
has expired before calling consume()
After extending the buffer, it should be allowed to shrink back to normal size over time as calls to consume
are made: we will not (nor can we without making some assumptions about the underlying Read
) go our of our way to shrink it back to normal size.
That buffer data structure is starting to sound kinda complicated to handle ad-hoc. I've created a separate issue #5 dealing with the data structure. This issue is on hold pending that issue's resolution.
See abonander/buf_redux, which provides the necessary lookahead features. We will not have to implement our own BufRead
after all.
The last commit I pushed (36c64a3) contains breaking changes. I don't think they were necessary, but I'll have to reevaluate that in the morning.
We still fail to detect precedent delims longer than the input buffer, but all trailing delims work.
This is related to Issue #3 in that both solutions require that we are able to read multiple buffers of information without consuming them.
Terminating Delimiters
Currently we search for terminating delimiters using the following algorithm:
Note that in the first test case below, the delimiter is
(two spaces). The first buffer ends with one space and the second buffer begins with one space. Together, they form a delimiter, but this delimiter is not in either buffer, so the entirety of both buffers is read.
The best algorithm I can come up with, which follows, depends on rust-lang/regex#425; else we must make our own regex engine and implement
hits_end() -> bool
which returns true if the DFA is not in a dead state after we consume the last input character. This would imply that the buffer ends with a prefix to a word in that regex's language, and if we consume additional input it may or may not become a delimiter.Such functionality would also help us with the issue of parsing arbitrarily long starting delims. However, since there is nothing (simple) we can do to remedy that, we must consider an alternative. I propose the following:
Note that this requires reading multiple buffers without consuming them. By default,
BufRead.fill_buf()
will happily return the same buffer over and over again if you do notconsume()
between calls.Precedent Delimiters
This has all of the same problems as the above problem, with one added caveat: it is impossible to tell the difference between a string that has no precedent delimiters and a string having a very long precedent delimiter—without being able to check for delimiter prefixes—except by reading to EOF.
This issue technically still exists in the case of trailing delimiters, but in that case there is no efficiency hit because reading to EOF if there is no trailing delimiter is actually the desired behavior.
Test Cases