turicas / rows

A common, beautiful interface to tabular data, no matter the format
GNU Lesser General Public License v3.0
865 stars 136 forks source link

Automatically replace NUL (\0x00) in CSV #273

Open turicas opened 6 years ago

turicas commented 6 years ago

Some CSV files come with NUL chars (\0x00) inside and the Python csv module doesn't know how to deal with it. So I think it's a great idea to have automatic NUL removal in the CSV plugin. An io.TextIOWrapper will do the job, like this one:

class NotNullTextWrapper(io.TextIOWrapper):

    def read(self, *args, **kwargs):
        data = super().read(*args, **kwargs)
        return data.replace('\x00', '')

    def readline(self, *args, **kwargs):
        data = super().readline(*args, **kwargs)
        return data.replace('\x00', '')

Sample file with this problem: http://arquivos.portaldatransparencia.gov.br/downloads.asp?a=2011&m=01&consulta=GastosDiretos

Exception raised: _csv.Error: line contains NULL byte

turicas commented 6 years ago

Fixed on d43be1dce2d4a64973fe4cae03a745fba7e6577e.

turicas commented 6 years ago

Reopenning because of this error: AttributeError: 'file' object has no attribute 'readable' (I think it's related to Python2) Maybe this thread helps.

turicas commented 6 years ago

Reverted merged change of #276 since it cause problems on python2. Trying to fix the problem in a new branch: feature/csv-remove-null-bytes.

mawkee commented 5 years ago

The file is no longer accessible, but it seems you're dealing with an UTF-16 encoded file. Try using:

b = open("file.csv", "rb").read().decode("utf-16")
turicas commented 5 years ago

@mawkee it was not an UTF-16-encoded file (this one was encoded in ISO-8859-15 but had \x00 bytes inside the data) - it didn't even have the BOM.

seocam commented 5 years ago

Our doesn't didn't seem to have it either but if you open with "rb" and then decode it magically works as utf-16.

mawkee commented 5 years ago

@turicas got it; I tried opening the data using ftfy and it worked all right for my case

fanden1337 commented 4 years ago

Some CSV files come with NUL chars (\0x00) inside and the Python csv module doesn't know how to deal with it. So I think it's a great idea to have automatic NUL removal in the CSV plugin. An io.TextIOWrapper will do the job, like this one:

class NotNullTextWrapper(io.TextIOWrapper):

    def read(self, *args, **kwargs):
        data = super().read(*args, **kwargs)
        return data.replace('\x00', '')

    def readline(self, *args, **kwargs):
        data = super().readline(*args, **kwargs)
        return data.replace('\x00', '')

Sample file with this problem: http://arquivos.portaldatransparencia.gov.br/downloads.asp?a=2011&m=01&consulta=GastosDiretos

Exception raised: _csv.Error: line contains NULL byte

Thanks for posting the code. Was also useful outside of this project.