Open TAYmit opened 3 weeks ago
I tried to ensure that my data is in utf8 format.
# Local or remote directory path to store the output compressed files.
out_root = "/data/shards/"
columns = {
'number': 'int',
'texts': 'str',
}
# Compression algorithm to use for dataset
compression = 'zstd:12'
# Hashing algorithm to use for dataset
hashes = ['sha1', 'xxh3_64']
# shard size limit, in bytes
size_limit = 7 << 30 # 4gb shard
print(f'Saving dataset (to {out_root})...')
infile = open('data.txt', encoding='utf-8')
with MDSWriter(out=out_root, columns=columns, compression=compression,
hashes=hashes, size_limit=size_limit) as out:
i=0
for text in tqdm.tqdm(infile):
sample={'number':i,'texts':text.encode('utf-8').strip()}
out.write(sample)
i=i+1
But still, after 6000 batches of run, I get following errors:
'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
Not sure how to proceed from here.
=============================================================================== update
I spent a lot of time inspecting my dataset and confirmed that it is valid UTF-8. I then tried loading the exact line where the error occurred and found that it’s actually encoded in UTF-8, but for some reason, StreamingDataset
still threw an error.
I suspected that the size_limit
might be causing an issue (either directly or indirectly), so I reduced it within the MDSWriter
loop (i.e., size_limit
= 4 << 30 rather than 7<<30). After making this change, StreamingDataset
worked fine.
For anyone encountering a similar issue, try reducing the size_limit
when creating a sharded dataset with MDSWriter
Summary:
Unexpected unicode error during the StreamingDataset
. No issue found in the raw dataset. Solved the problem by re-creating sharded datasets with size_limit = 4<<30
rather than 7<<30
.
Closing this.
Hey @TAYmit that seems indicative of a bug on our side, would it be possible for you to share a shard file or a small repro of this behavior with us?
Hello,
The dataset is quite large (around 200GB in total), and I haven't yet tested it with a smaller dataset. I'll try working with a smaller subset (ideally under 10GB) over the next few days to see if I can reproduce the issue.
After training with approximately 30,000 batches in streaming, I encountered this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfb in position 2: invalid start byte
The dataset is around 200 GB. Is there an efficient way to debug the dataset, or any try-catch approach I could use within the streaming process to handle this error?
Thanks!