Open GregoryKimball opened 1 week ago
Byte range reading of size chunk_size
in the JSON reader is implemented by reading at most total_bytes_read = chunk_size + search_subchunks_size
, where search_subchunks_size
is the additional bytes read to search for the end of the last incomplete line in the chunk_size
range. The maximum number of subchunks in search_subchunks_size
is fixed, though the size of each subchunk is still a function of chunk_size
. We are not catching the error thrown if the newline character is not found within search_subchunks_size
.
In the repro above, the error occurs in the first byte_size
pass itself (having a cpp test repro helped narrow this down). The size of the first line is ~2MB, but the total_bytes_read
is capped at ~1.09MB.
Proposed solution: Reallocate to $2 \times$ total_bytes_read
when the newline character is not found within the first set of subchunks. Also consider doubling the size of each subchunk for each realloc pass. @vuule do you think this is a reasonable approach?
Update: Consider adding a sanity threshold, say 1GB, so that the reallocation size does not blow up over multiple passes.
Byte range reading of size
chunk_size
in the JSON reader is implemented by reading at mosttotal_bytes_read = chunk_size + search_subchunks_size
, wheresearch_subchunks_size
is the additional bytes read to search for the end of the last incomplete line in thechunk_size
range. The maximum number of subchunks insearch_subchunks_size
is fixed, though the size of each subchunk is still a function ofchunk_size
. We are not catching the error thrown if the newline character is not found withinsearch_subchunks_size
. In the repro above, the error occurs in the firstbyte_size
pass itself (having a cpp test repro helped narrow this down). The size of the first line is ~2MB, but thetotal_bytes_read
is capped at ~1.09MB. Proposed solution: Reallocate to 2×total_bytes_read
when the newline character is not found within the first set of subchunks. Also consider doubling the size of each subchunk for each realloc pass. @vuule do you think this is a reasonable approach?
Sounds good. A separate potential problem here is that we might get duplicate rows in the output if rows are more than twice the size of byte ranges. Not sure if this is something we'll have to account for. @GregoryKimball is this 1MB byte range an actual use case, or perhaps something like a stress test?
Oh you're right, I did not consider the duplicate row problem. If reading the entire source list over byte ranges is a common use case (low memory footprint constraints?), then a possible solution is to have a reader option for this behavior, and have read_json
estimate the byte_range_size
such that row size is less that twice byte_range_size
.
When reading JSONL data of complete documents for LLM training, we encounter an unrecoverable error. The error does not occur when reading the whole file without using byte ranges.
(please excuse me for posting an NVIDIA internal path)