rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.02k stars 871 forks source link

[FEA] JSON parsing bug with byte-range reading #16123

Open GregoryKimball opened 1 week ago

GregoryKimball commented 1 week ago

When reading JSONL data of complete documents for LLM training, we encounter an unrecoverable error. The error does not occur when reading the whole file without using byte ranges.

(please excuse me for posting an NVIDIA internal path)

import cudf
src = '/datasets/prospector-lm/Books3_shuf/resharded/books3_00000.jsonl'

byte_size = 1_000_000
for n in range(10):
    try:
        print(f'processing chunk {n}')
        df = cudf.read_json(src, lines=True, byte_range=(byte_size*n, byte_size))
        print(f'received rows count={len(df)}')
    except Exception as err:
        print(err)
        break
(base) rapids@293a80b63032:/nfs/20240618_no_oom_prefetch$ python byte_range_bug.py 
processing chunk 0
received rows count=2
processing chunk 1
received rows count=4
processing chunk 2
received rows count=3
processing chunk 3
CUDF failure at:/opt/conda/conda-bld/work/cpp/src/io/json/json_tree.cu:272: JSON Parser encountered an invalid format at location 669943
shrshi commented 6 days ago

Byte range reading of size chunk_size in the JSON reader is implemented by reading at most total_bytes_read = chunk_size + search_subchunks_size, where search_subchunks_size is the additional bytes read to search for the end of the last incomplete line in the chunk_size range. The maximum number of subchunks in search_subchunks_size is fixed, though the size of each subchunk is still a function of chunk_size. We are not catching the error thrown if the newline character is not found within search_subchunks_size. In the repro above, the error occurs in the first byte_size pass itself (having a cpp test repro helped narrow this down). The size of the first line is ~2MB, but the total_bytes_read is capped at ~1.09MB. Proposed solution: Reallocate to $2 \times$ total_bytes_read when the newline character is not found within the first set of subchunks. Also consider doubling the size of each subchunk for each realloc pass. @vuule do you think this is a reasonable approach? Update: Consider adding a sanity threshold, say 1GB, so that the reallocation size does not blow up over multiple passes.

vuule commented 6 days ago

Byte range reading of size chunk_size in the JSON reader is implemented by reading at most total_bytes_read = chunk_size + search_subchunks_size, where search_subchunks_size is the additional bytes read to search for the end of the last incomplete line in the chunk_size range. The maximum number of subchunks in search_subchunks_size is fixed, though the size of each subchunk is still a function of chunk_size. We are not catching the error thrown if the newline character is not found within search_subchunks_size. In the repro above, the error occurs in the first byte_size pass itself (having a cpp test repro helped narrow this down). The size of the first line is ~2MB, but the total_bytes_read is capped at ~1.09MB. Proposed solution: Reallocate to 2× total_bytes_read when the newline character is not found within the first set of subchunks. Also consider doubling the size of each subchunk for each realloc pass. @vuule do you think this is a reasonable approach?

Sounds good. A separate potential problem here is that we might get duplicate rows in the output if rows are more than twice the size of byte ranges. Not sure if this is something we'll have to account for. @GregoryKimball is this 1MB byte range an actual use case, or perhaps something like a stress test?

shrshi commented 6 days ago

Oh you're right, I did not consider the duplicate row problem. If reading the entire source list over byte ranges is a common use case (low memory footprint constraints?), then a possible solution is to have a reader option for this behavior, and have read_json estimate the byte_range_size such that row size is less that twice byte_range_size.