rickardp / splitstream

Continuous object splitter for C and Python
Apache License 2.0
44 stars 9 forks source link

Only discard document buffer exceeding length when entire buffer is scanned #6

Closed tajmorton closed 6 years ago

tajmorton commented 6 years ago

@rickardp

Fixes a bug where the scan/in-progress-document buffer could be discarded if the buffer ever exceeded the max document size. If a document was found by scan() (i.e., end > 0), then we should not discard the in-progress document (s->doc), because whatever remained in s->doc was not scanned, so we don't know if it contains a document (of valid length). This can happen if bufsize is much larger than maxdocsize, because a huge amount of data will be read out of the file (causing the amount of data left in the buffer after a scan to exceed the max document size).

rickardp commented 6 years ago

Thanks for the contribution! The idea from the beginning was that documents larger than max should always be discarded, but this did not hold before either, so I believe your change makes things better!