[BUG] Tokenised data written is truncated for too long sequences (affects csv only)

Describe the bug Tokenised transformers dataset object csv files are truncated if the sequence is too long.

To Reproduce Please provide a minimal reproducible example with all steps to reproduce the behaviour before submitting an issue:

Fields input_tokens, token_type_ids, attention_mask are truncated if the feature is too long. This is true for output csv file only.

# sample run on arbitrary file with very long item
create_dataset_bio <infile_path_1> <infile_path_2> <tokeniser>
# sample output csv file
some_seq,<very very long sequence>,1,"[10 ...   20]","[0 ... 0]","[1 ... 1]"

Please make sure to include environment info including python and dependency versions. You can access this with pip freeze or conda list as needed.

# this was installed with conda install -c tyronechen ziran
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
_py-xgboost-mutex         2.0                       cpu_0    conda-forge
abseil-cpp                20210324.2           h9c3ff4c_0    conda-forge
aiohttp                   3.8.4            py39h72bdee0_0    conda-forge
aiohttp-cors              0.7.0                      py_0    conda-forge
aioredis                  1.3.1                      py_0    conda-forge
aiosignal                 1.3.1              pyhd8ed1ab_0    conda-forge
alsa-lib                  1.2.8                h166bdaf_0    conda-forge
arrow-cpp                 8.0.0           py39heccc63a_1_cpu    conda-forge
async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
attr                      2.5.1                h166bdaf_1    conda-forge
attrs                     22.2.0             pyh71513ae_0    conda-forge
aws-c-cal                 0.5.11               h95a6274_0    conda-forge
aws-c-common              0.6.2                h7f98852_0    conda-forge
aws-c-event-stream        0.2.7               h3541f99_13    conda-forge
aws-c-io                  0.10.5               hfb6a706_0    conda-forge
aws-checksums             0.1.11               ha31a3da_7    conda-forge
aws-sdk-cpp               1.8.186              hecaee15_4    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                pyhd8ed1ab_3    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
blessed                   1.19.1             pyhe4f9e05_2    conda-forge
brotli                    1.0.9                h166bdaf_8    conda-forge
brotli-bin                1.0.9                h166bdaf_8    conda-forge
brotlipy                  0.7.0           py39hb9d737c_1005    conda-forge
bz2file                   0.98                       py_0    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2022.12.7            ha878542_0    conda-forge
cachetools                5.3.0              pyhd8ed1ab_0    conda-forge
cairo                     1.16.0            ha61ee94_1014    conda-forge
captum                    0.6.0              pyhd8ed1ab_0    conda-forge
certifi                   2022.12.7          pyhd8ed1ab_0    conda-forge
cffi                      1.15.1           py39he91dace_3    conda-forge
charset-normalizer        2.1.1              pyhd8ed1ab_0    conda-forge
click                     8.0.4            py39hf3d152e_0    conda-forge
cloudpickle               2.2.1              pyhd8ed1ab_0    conda-forge
colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
colorful                  0.5.4              pyhd8ed1ab_0    conda-forge
cryptography              39.0.0           py39hd598818_0    conda-forge
cudatoolkit               11.8.0              h37601d7_11    conda-forge
cudnn                     8.4.1.50             hed8a83a_0    conda-forge
cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
dataclasses               0.8                pyhc8e2a94_3    conda-forge
datasets                  2.10.1             pyhd8ed1ab_0    conda-forge
dbus                      1.13.6               h5008d03_3    conda-forge
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
dill                      0.3.6              pyhd8ed1ab_1    conda-forge
distlib                   0.3.6              pyhd8ed1ab_0    conda-forge
docker-pycreds            0.4.0                      py_0    conda-forge
expat                     2.5.0                h27087fc_0    conda-forge
fftw                      3.3.10          nompi_hf0379b8_106    conda-forge
filelock                  3.10.0             pyhd8ed1ab_0    conda-forge
font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
font-ttf-inconsolata      3.000                h77eed37_0    conda-forge
font-ttf-source-code-pro  2.038                h77eed37_0    conda-forge
font-ttf-ubuntu           0.83                 hab24e00_0    conda-forge
fontconfig                2.14.2               h14ed4e7_0    conda-forge
fonts-conda-ecosystem     1                             0    conda-forge
fonts-conda-forge         1                             0    conda-forge
fonttools                 4.39.2           py39h72bdee0_0    conda-forge
freetype                  2.12.1               hca18f0e_1    conda-forge
frozenlist                1.3.3            py39hb9d737c_0    conda-forge
fsspec                    2023.3.0           pyhd8ed1ab_1    conda-forge
future                    0.18.3             pyhd8ed1ab_0    conda-forge
gensim                    4.2.0            py39h1832856_0    conda-forge
gettext                   0.21.1               h27087fc_0    conda-forge
gflags                    2.2.2             he1b5a44_1004    conda-forge
gitdb                     4.0.10             pyhd8ed1ab_0    conda-forge
gitpython                 3.1.31             pyhd8ed1ab_0    conda-forge
glib                      2.74.1               h6239696_1    conda-forge
glib-tools                2.74.1               h6239696_1    conda-forge
glog                      0.6.0                h6f12383_0    conda-forge
google-api-core           2.10.0             pyhd8ed1ab_0    conda-forge
google-auth               2.16.2             pyh1a96a4e_0    conda-forge
googleapis-common-protos  1.57.0           py39hf3d152e_0    conda-forge
gpustat                   1.0.0              pyhd8ed1ab_0    conda-forge
graphite2                 1.3.13            h58526e2_1001    conda-forge
grpc-cpp                  1.43.2               h9e046d8_3    conda-forge
grpcio                    1.43.0           py39hff7568b_0    conda-forge
gst-plugins-base          1.21.3               h4243ec0_1    conda-forge
gstreamer                 1.21.3               h25f0c4b_1    conda-forge
gstreamer-orc             0.4.33               h166bdaf_0    conda-forge
harfbuzz                  6.0.0                h8e241bc_0    conda-forge
hiredis                   2.0.0            py39hb9d737c_3    conda-forge
huggingface_hub           0.13.2             pyhd8ed1ab_0    conda-forge
hyperopt                  0.2.7              pyhd8ed1ab_0    conda-forge
icu                       70.1                 h27087fc_0    conda-forge
idna                      3.4                pyhd8ed1ab_0    conda-forge
importlib-metadata        6.0.0              pyha770c72_0    conda-forge
importlib_metadata        6.0.0                hd8ed1ab_0    conda-forge
importlib_resources       5.12.0             pyhd8ed1ab_0    conda-forge
ipython                   7.33.0           py39hf3d152e_0    conda-forge
jack                      1.9.22               h11f4161_0    conda-forge
jedi                      0.18.2             pyhd8ed1ab_0    conda-forge
joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
jpeg                      9e                   h0b41bf4_3    conda-forge
jsonschema                4.17.3             pyhd8ed1ab_0    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
kiwisolver                1.4.4            py39hf939315_1    conda-forge
krb5                      1.20.1               hf9c8cef_0    conda-forge
lame                      3.100             h166bdaf_1003    conda-forge
lcms2                     2.15                 hfd0df8a_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
lerc                      4.0.0                h27087fc_0    conda-forge
libblas                   3.9.0            12_linux64_mkl    conda-forge
libbrotlicommon           1.0.9                h166bdaf_8    conda-forge
libbrotlidec              1.0.9                h166bdaf_8    conda-forge
libbrotlienc              1.0.9                h166bdaf_8    conda-forge
libcap                    2.66                 ha37c62d_0    conda-forge
libcblas                  3.9.0            12_linux64_mkl    conda-forge
libclang                  15.0.7          default_had23c3d_1    conda-forge
libclang13                15.0.7          default_h3e3d535_1    conda-forge
libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
libcups                   2.3.3                h36d4200_3    conda-forge
libcurl                   7.87.0               h6312ad2_0    conda-forge
libdb                     6.2.32               h9c3ff4c_0    conda-forge
libdeflate                1.17                 h0b41bf4_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libevent                  2.1.10               h9b69904_4    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libflac                   1.4.2                h27087fc_0    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgcrypt                 1.10.1               h166bdaf_0    conda-forge
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libglib                   2.74.1               h606061b_1    conda-forge
libgoogle-cloud           1.36.0               h6945097_0    conda-forge
libgpg-error              1.46                 h620e276_0    conda-forge
libhwloc                  2.9.0                hd6dc26d_0    conda-forge
libiconv                  1.17                 h166bdaf_0    conda-forge
liblapack                 3.9.0            12_linux64_mkl    conda-forge
libllvm15                 15.0.7               hadd5161_1    conda-forge
libnghttp2                1.51.0               hdcd2b5c_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libogg                    1.3.4                h7f98852_1    conda-forge
libopus                   1.3.1                h7f98852_1    conda-forge
libpng                    1.6.39               h753d276_0    conda-forge
libpq                     15.1                 h2baec63_3    conda-forge
libprotobuf               3.19.4               h780b84a_0    conda-forge
libsndfile                1.2.0                hb75c966_0    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libssh2                   1.10.0               haa6b8db_3    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libsystemd0               252                  h2a991cd_0    conda-forge
libthrift                 0.16.0               h491838f_2    conda-forge
libtiff                   4.5.0                h6adf6a1_2    conda-forge
libtool                   2.4.7                h27087fc_0    conda-forge
libudev1                  253                  h0b41bf4_0    conda-forge
libunwind                 1.6.2                h9c3ff4c_0    conda-forge
libutf8proc               2.8.0                h166bdaf_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libvorbis                 1.3.7                h9c3ff4c_0    conda-forge
libwebp-base              1.3.0                h0b41bf4_0    conda-forge
libxcb                    1.13              h7f98852_1004    conda-forge
libxgboost                1.7.1            cpu_ha3b9936_0    conda-forge
libxkbcommon              1.5.0                h79f4944_1    conda-forge
libxml2                   2.10.3               hca2bb57_3    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
llvm-openmp               15.0.7               h0cdce71_0    conda-forge
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
magma                     2.5.4                hc72dce7_4    conda-forge
matplotlib                3.5.2            py39hf3d152e_1    conda-forge
matplotlib-base           3.5.2            py39h700656a_1    conda-forge
matplotlib-inline         0.1.6              pyhd8ed1ab_0    conda-forge
mkl                       2021.4.0           h8d4b97c_729    conda-forge
mpg123                    1.31.2               hcb278e6_0    conda-forge
mpmath                    1.3.0              pyhd8ed1ab_0    conda-forge
msgpack-python            1.0.5            py39h4b4f3f3_0    conda-forge
multidict                 6.0.4            py39h72bdee0_0    conda-forge
multiprocess              0.70.14          py39hb9d737c_3    conda-forge
munkres                   1.1.4              pyh9f0ad1d_0    conda-forge
mysql-common              8.0.32               h14678bc_0    conda-forge
mysql-libs                8.0.32               h54cf53e_0    conda-forge
nccl                      2.14.3.1             h0800d71_0    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
networkx                  3.0                pyhd8ed1ab_0    conda-forge
ninja                     1.11.1               h924138e_0    conda-forge
nipals                    0.5.5                    pypi_0    pypi
nspr                      4.35                 h27087fc_0    conda-forge
nss                       3.89                 he45b914_0    conda-forge
numpy                     1.24.2           py39h7360e5f_0    conda-forge
nvidia-ml-py              11.495.46          pyhd8ed1ab_0    conda-forge
opencensus                0.11.2             pyhd8ed1ab_0    conda-forge
opencensus-context        0.1.3            py39hf3d152e_1    conda-forge
openjpeg                  2.5.0                hfec8fc6_2    conda-forge
openssl                   1.1.1t               h0b41bf4_0    conda-forge
orc                       1.7.3                h1be678f_0    conda-forge
packaging                 23.0               pyhd8ed1ab_0    conda-forge
pandas                    1.4.2            py39h1832856_2    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
parso                     0.8.3              pyhd8ed1ab_0    conda-forge
pathtools                 0.1.2                      py_1    conda-forge
patsy                     0.5.3              pyhd8ed1ab_0    conda-forge
pcre2                     10.40                hc3806b6_0    conda-forge
pexpect                   4.8.0              pyh1a96a4e_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    9.4.0            py39h2320bf1_1    conda-forge
pip                       23.0.1             pyhd8ed1ab_0    conda-forge
pixman                    0.40.0               h36c2ea0_0    conda-forge
pkgutil-resolve-name      1.3.10             pyhd8ed1ab_0    conda-forge
platformdirs              3.1.1              pyhd8ed1ab_0    conda-forge
ply                       3.11                       py_1    conda-forge
pooch                     1.7.0              pyhd8ed1ab_0    conda-forge
powerlaw                  1.4.6              pyh9f0ad1d_1    conda-forge
prometheus_client         0.13.1             pyhd8ed1ab_0    conda-forge
promise                   2.3              py39hf3d152e_7    conda-forge
prompt-toolkit            3.0.38             pyha770c72_0    conda-forge
protobuf                  3.19.4           py39he80948d_0    conda-forge
psutil                    5.9.4            py39hb9d737c_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pulseaudio                16.1                 h4ab2085_1    conda-forge
py-spy                    0.3.14               h87a5ac0_0    conda-forge
py-xgboost                1.7.1           cpu_py39h4655687_0    conda-forge
py4j                      0.10.9.7           pyhd8ed1ab_0    conda-forge
pyarrow                   8.0.0           py39h42d110c_1_cpu    conda-forge
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.7                      py_0    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pygments                  2.14.0             pyhd8ed1ab_0    conda-forge
pyopenssl                 23.0.0             pyhd8ed1ab_0    conda-forge
pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
pyqt                      5.15.7           py39h5c7b992_3    conda-forge
pyqt5-sip                 12.11.0          py39h227be39_3    conda-forge
pyrsistent                0.19.3           py39h72bdee0_0    conda-forge
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
python                    3.9.15          h47a2c10_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-xxhash             3.2.0            py39h72bdee0_0    conda-forge
python_abi                3.9                      3_cp39    conda-forge
pytorch                   1.10.0          cuda112py39h3ad47f5_1    conda-forge
pytz                      2022.7.1           pyhd8ed1ab_0    conda-forge
pyu2f                     0.1.5              pyhd8ed1ab_0    conda-forge
pyyaml                    6.0              py39hb9d737c_5    conda-forge
qt-main                   5.15.6               h18908ee_6    conda-forge
ray-core                  1.13.0           py39hecbb631_2    conda-forge
ray-default               1.13.0           py39hf3d152e_2    conda-forge
re2                       2022.02.01           h9c3ff4c_0    conda-forge
readline                  8.1.2                h0f457ee_0    conda-forge
regex                     2022.10.31       py39hb9d737c_0    conda-forge
requests                  2.28.2             pyhd8ed1ab_0    conda-forge
responses                 0.18.0             pyhd8ed1ab_0    conda-forge
rsa                       4.9                pyhd8ed1ab_0    conda-forge
s2n                       1.0.10               h9b69904_0    conda-forge
sacremoses                0.0.53             pyhd8ed1ab_0    conda-forge
scikit-learn              1.1.1            py39h4037b75_0    conda-forge
scipy                     1.10.1           py39h7360e5f_0    conda-forge
screed                    1.0.5              pyhd8ed1ab_1    conda-forge
seaborn                   0.11.2               hd8ed1ab_0    conda-forge
seaborn-base              0.11.2             pyhd8ed1ab_0    conda-forge
sentencepiece             0.1.96           py39hf939315_1    conda-forge
sentry-sdk                1.17.0             pyhd8ed1ab_0    conda-forge
setproctitle              1.2.2            py39hb9d737c_2    conda-forge
setuptools                67.6.0             pyhd8ed1ab_0    conda-forge
shortuuid                 1.0.11             pyhd8ed1ab_0    conda-forge
sip                       6.7.7            py39h227be39_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sleef                     3.5.1                h9b69904_2    conda-forge
smart_open                6.3.0              pyhd8ed1ab_1    conda-forge
smmap                     3.0.5              pyh44b312d_0    conda-forge
snappy                    1.1.10               h9fff704_0    conda-forge
statsmodels               0.13.5           py39h2ae25f5_2    conda-forge
tabulate                  0.9.0              pyhd8ed1ab_1    conda-forge
tbb                       2021.8.0             hf52228f_0    conda-forge
threadpoolctl             3.1.0              pyh8a188c0_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tokenizers                0.12.1           py39h3045328_1    conda-forge
toml                      0.10.2             pyhd8ed1ab_0    conda-forge
tornado                   6.2              py39hb9d737c_1    conda-forge
tqdm                      4.64.0             pyhd8ed1ab_0    conda-forge
traitlets                 5.9.0              pyhd8ed1ab_0    conda-forge
transformers              4.23.1             pyhd8ed1ab_0    conda-forge
transformers-interpret    0.8.1              pyhd8ed1ab_0    conda-forge
typing-extensions         4.5.0                hd8ed1ab_0    conda-forge
typing_extensions         4.5.0              pyha770c72_0    conda-forge
tzdata                    2022g                h191b570_0    conda-forge
unicodedata2              15.0.0           py39hb9d737c_0    conda-forge
urllib3                   1.26.15            pyhd8ed1ab_0    conda-forge
virtualenv                20.21.0            pyhd8ed1ab_0    conda-forge
wandb                     0.13.4             pyhd8ed1ab_0    conda-forge
wcwidth                   0.2.6              pyhd8ed1ab_0    conda-forge
weightwatcher             0.6.4                      py_0    tyronechen
wheel                     0.40.0             pyhd8ed1ab_0    conda-forge
xcb-util                  0.4.0                h516909a_0    conda-forge
xcb-util-image            0.4.0                h166bdaf_0    conda-forge
xcb-util-keysyms          0.4.0                h516909a_0    conda-forge
xcb-util-renderutil       0.3.9                h166bdaf_0    conda-forge
xcb-util-wm               0.4.1                h516909a_0    conda-forge
xgboost                   1.7.1           cpu_py39h4655687_0    conda-forge
xkeyboard-config          2.38                 h0b41bf4_0    conda-forge
xorg-kbproto              1.0.7             h7f98852_1002    conda-forge
xorg-libice               1.0.10               h7f98852_0    conda-forge
xorg-libsm                1.2.3             hd9c2040_1000    conda-forge
xorg-libx11               1.8.4                h0b41bf4_0    conda-forge
xorg-libxau               1.0.9                h7f98852_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xorg-libxext              1.3.4                h0b41bf4_2    conda-forge
xorg-libxrender           0.9.10            h7f98852_1003    conda-forge
xorg-renderproto          0.11.1            h7f98852_1002    conda-forge
xorg-xextproto            7.3.0             h0b41bf4_1003    conda-forge
xorg-xproto               7.0.31            h7f98852_1007    conda-forge
xxhash                    0.8.1                h0b41bf4_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
yarl                      1.8.2            py39hb9d737c_0    conda-forge
yellowbrick               1.3.post1          pyhd8ed1ab_1    conda-forge
zipp                      3.15.0             pyhd8ed1ab_0    conda-forge
ziran                     1.0.9                         0    tyronechen
zlib                      1.2.13               h166bdaf_4    conda-forge
zstd                      1.5.2                h3eb15da_6    conda-forge

Expected behavior A clear and concise description of what you expected to happen.

csv files should not have truncated array values.

Suggested fix If known.

Temporary fix: Use parquet and json files as input for training since these are unaffected.

Long term fix: Increase the array size limit for printing on pandas and/or numpy.

Screenshots If applicable, add screenshots to help explain your problem.

Not applicable

tyronechen / genomenlp

[BUG] Tokenised data written is truncated for too long sequences (affects csv only) #1