turicas / rows

A common, beautiful interface to tabular data, no matter the format
GNU Lesser General Public License v3.0
869 stars 134 forks source link

Set field_size_limit to max long int size on csv plugin #352

Open humrochagf opened 4 years ago

humrochagf commented 4 years ago

This pull request touches issue #306 and fixes the issue for all platforms without falling back to hardcoded value.

The overflow was happening due to the fact of python CSV implementation uses a long to store max limit (https://github.com/python/cpython/blob/c88239f864a27f673c0f0a9e62d2488563f9d081/Modules/_csv.c#L21)

Using the previous implementation of sys.maxsize has a drawback on Windows 64bit LLP64 data model that even using 64bit wide pointers, they keep long as a 32bit integer to preserve type compatibility (see more in https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models)

So, to have a reliable max long size on the target system you can use ctypes.c_ulong(-1) that cycles back to ulong max value divided by 2 to get the size of long on the target platform.

Regarding memory impact, the long value for the limit is already allocated to store the max limit and it's used only to block sizes higher than that. The CSV itself is allocated on demand.

I ran the profiler on a simple load of a big file from brasil.io https://brasil.io/dataset/genero-nomes/nomes/ and I got the same values after the csv import operation:

Filename: test.py (with unicodecsv.field_size_limit commented to use default)

Line #    Mem usage    Increment   Line Contents
================================================
     4     38.1 MiB     38.1 MiB   @profile
     5                             def import_file():
     6     87.3 MiB     49.2 MiB       import rows
     7
     8    179.0 MiB     91.7 MiB       data = rows.import_from_csv('./nomes.csv')
     9
    10    179.0 MiB      0.0 MiB       for field in data:
    11    179.0 MiB      0.0 MiB           print(field)

versus

Filename: test.py (using max long size)

Line #    Mem usage    Increment   Line Contents
================================================
     4     38.1 MiB     38.1 MiB   @profile
     5                             def import_file():
     6     87.3 MiB     49.2 MiB       import rows
     7
     8    179.0 MiB     91.7 MiB       data = rows.import_from_csv('./nomes.csv')
     9
    10    179.0 MiB      0.0 MiB       for field in data:
    11    179.0 MiB      0.0 MiB           print(field)