UnicodeDecodeError: ... Efficient way to debug the dataset with streaming?

TAYmit commented 3 weeks ago

After training with approximately 30,000 batches in streaming, I encountered this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfb in position 2: invalid start byte

The dataset is around 200 GB. Is there an efficient way to debug the dataset, or any try-catch approach I could use within the streaming process to handle this error?

Thanks!

TAYmit commented 2 weeks ago

I tried to ensure that my data is in utf8 format.


# Local or remote directory path to store the output compressed files.

  out_root = "/data/shards/"

  columns = {
      'number': 'int',
      'texts': 'str',
  }

  # Compression algorithm to use for dataset
  compression = 'zstd:12'

  # Hashing algorithm to use for dataset
  hashes = ['sha1', 'xxh3_64']

  # shard size limit, in bytes
  size_limit = 7 << 30  # 4gb shard

  print(f'Saving dataset (to {out_root})...')
  infile = open('data.txt', encoding='utf-8')

  with MDSWriter(out=out_root, columns=columns, compression=compression,
                 hashes=hashes, size_limit=size_limit) as out:
      i=0
      for text in tqdm.tqdm(infile):
          sample={'number':i,'texts':text.encode('utf-8').strip()}
          out.write(sample)
          i=i+1

But still, after 6000 batches of run, I get following errors:

'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte

Not sure how to proceed from here.

=============================================================================== update

I spent a lot of time inspecting my dataset and confirmed that it is valid UTF-8. I then tried loading the exact line where the error occurred and found that it’s actually encoded in UTF-8, but for some reason, StreamingDatasetstill threw an error.

I suspected that the size_limitmight be causing an issue (either directly or indirectly), so I reduced it within the MDSWriterloop (i.e., size_limit= 4 << 30 rather than 7<<30). After making this change, StreamingDataset worked fine.

For anyone encountering a similar issue, try reducing the size_limitwhen creating a sharded dataset with MDSWriter

Summary:

Unexpected unicode error during the StreamingDataset. No issue found in the raw dataset. Solved the problem by re-creating sharded datasets with size_limit = 4<<30 rather than 7<<30.

Closing this.

snarayan21 commented 2 weeks ago

Hey @TAYmit that seems indicative of a bug on our side, would it be possible for you to share a shard file or a small repro of this behavior with us?

TAYmit commented 2 weeks ago

Hello,

The dataset is quite large (around 200GB in total), and I haven't yet tested it with a smaller dataset. I'll try working with a smaller subset (ideally under 10GB) over the next few days to see if I can reproduce the issue.

mosaicml / streaming

UnicodeDecodeError: ... Efficient way to debug the dataset with streaming? #820