pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.13k stars 152 forks source link

Changing decoding method in StreamReader #693

Open is-jlehrer opened 2 years ago

is-jlehrer commented 2 years ago

🐛 Describe the bug

Hi,

When decoding from a file stream in StreamReader, torchdata automatically assumes the incoming bytes are UTF-8. However, in the case of alternate encoding's this will error (in my case UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 3: invalid continuation byte). How do we change the decoding method to fit the particular data stream?

Versions

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.0
[pip3] pytorch-lightning==1.6.4
[pip3] torch==1.11.0
[pip3] torchdata==0.3.0
[pip3] torchmetrics==0.9.1
[pip3] torchvision==0.12.0
[conda] numpy                     1.23.0                   pypi_0    pypi
[conda] pytorch-lightning         1.6.4                    pypi_0    pypi
[conda] torch                     1.11.0                   pypi_0    pypi
[conda] torchdata                 0.3.0                    pypi_0    pypi
[conda] torchmetrics              0.9.1                    pypi_0    pypi
[conda] torchvision               0.12.0                   pypi_0    pypi
is-jlehrer commented 2 years ago

To be more specific, is there no way to read from StreamReader as bytes?

ejguan commented 2 years ago

It depends on how you open your file, rather than StreamReader. If you use FileOpener (functional API as open_files), you can specify the encoding to b to open file in bytes.