tyronechen / genomenlp

https://genomenlp.readthedocs.io/en/latest/
MIT License
5 stars 3 forks source link

[BUG] Tokenised data written is truncated for too long sequences (affects csv only) #1

Closed tyronechen closed 1 year ago

tyronechen commented 1 year ago

Describe the bug Tokenised transformers dataset object csv files are truncated if the sequence is too long.

To Reproduce Please provide a minimal reproducible example with all steps to reproduce the behaviour before submitting an issue:

Fields input_tokens, token_type_ids, attention_mask are truncated if the feature is too long. This is true for output csv file only.

# sample run on arbitrary file with very long item
create_dataset_bio <infile_path_1> <infile_path_2> <tokeniser>
# sample output csv file
some_seq,<very very long sequence>,1,"[10 ...   20]","[0 ... 0]","[1 ... 1]"

Please make sure to include environment info including python and dependency versions. You can access this with pip freeze or conda list as needed.

# this was installed with conda install -c tyronechen ziran
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
_py-xgboost-mutex         2.0                       cpu_0    conda-forge
abseil-cpp                20210324.2           h9c3ff4c_0    conda-forge
aiohttp                   3.8.4            py39h72bdee0_0    conda-forge
aiohttp-cors              0.7.0                      py_0    conda-forge
aioredis                  1.3.1                      py_0    conda-forge
aiosignal                 1.3.1              pyhd8ed1ab_0    conda-forge
alsa-lib                  1.2.8                h166bdaf_0    conda-forge
arrow-cpp                 8.0.0           py39heccc63a_1_cpu    conda-forge
async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
attr                      2.5.1                h166bdaf_1    conda-forge
attrs                     22.2.0             pyh71513ae_0    conda-forge
aws-c-cal                 0.5.11               h95a6274_0    conda-forge
aws-c-common              0.6.2                h7f98852_0    conda-forge
aws-c-event-stream        0.2.7               h3541f99_13    conda-forge
aws-c-io                  0.10.5               hfb6a706_0    conda-forge
aws-checksums             0.1.11               ha31a3da_7    conda-forge
aws-sdk-cpp               1.8.186              hecaee15_4    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                pyhd8ed1ab_3    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
blessed                   1.19.1             pyhe4f9e05_2    conda-forge
brotli                    1.0.9                h166bdaf_8    conda-forge
brotli-bin                1.0.9                h166bdaf_8    conda-forge
brotlipy                  0.7.0           py39hb9d737c_1005    conda-forge
bz2file                   0.98                       py_0    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2022.12.7            ha878542_0    conda-forge
cachetools                5.3.0              pyhd8ed1ab_0    conda-forge
cairo                     1.16.0            ha61ee94_1014    conda-forge
captum                    0.6.0              pyhd8ed1ab_0    conda-forge
certifi                   2022.12.7          pyhd8ed1ab_0    conda-forge
cffi                      1.15.1           py39he91dace_3    conda-forge
charset-normalizer        2.1.1              pyhd8ed1ab_0    conda-forge
click                     8.0.4            py39hf3d152e_0    conda-forge
cloudpickle               2.2.1              pyhd8ed1ab_0    conda-forge
colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
colorful                  0.5.4              pyhd8ed1ab_0    conda-forge
cryptography              39.0.0           py39hd598818_0    conda-forge
cudatoolkit               11.8.0              h37601d7_11    conda-forge
cudnn                     8.4.1.50             hed8a83a_0    conda-forge
cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
dataclasses               0.8                pyhc8e2a94_3    conda-forge
datasets                  2.10.1             pyhd8ed1ab_0    conda-forge
dbus                      1.13.6               h5008d03_3    conda-forge
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
dill                      0.3.6              pyhd8ed1ab_1    conda-forge
distlib                   0.3.6              pyhd8ed1ab_0    conda-forge
docker-pycreds            0.4.0                      py_0    conda-forge
expat                     2.5.0                h27087fc_0    conda-forge
fftw                      3.3.10          nompi_hf0379b8_106    conda-forge
filelock                  3.10.0             pyhd8ed1ab_0    conda-forge
font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
font-ttf-inconsolata      3.000                h77eed37_0    conda-forge
font-ttf-source-code-pro  2.038                h77eed37_0    conda-forge
font-ttf-ubuntu           0.83                 hab24e00_0    conda-forge
fontconfig                2.14.2               h14ed4e7_0    conda-forge
fonts-conda-ecosystem     1                             0    conda-forge
fonts-conda-forge         1                             0    conda-forge
fonttools                 4.39.2           py39h72bdee0_0    conda-forge
freetype                  2.12.1               hca18f0e_1    conda-forge
frozenlist                1.3.3            py39hb9d737c_0    conda-forge
fsspec                    2023.3.0           pyhd8ed1ab_1    conda-forge
future                    0.18.3             pyhd8ed1ab_0    conda-forge
gensim                    4.2.0            py39h1832856_0    conda-forge
gettext                   0.21.1               h27087fc_0    conda-forge
gflags                    2.2.2             he1b5a44_1004    conda-forge
gitdb                     4.0.10             pyhd8ed1ab_0    conda-forge
gitpython                 3.1.31             pyhd8ed1ab_0    conda-forge
glib                      2.74.1               h6239696_1    conda-forge
glib-tools                2.74.1               h6239696_1    conda-forge
glog                      0.6.0                h6f12383_0    conda-forge
google-api-core           2.10.0             pyhd8ed1ab_0    conda-forge
google-auth               2.16.2             pyh1a96a4e_0    conda-forge
googleapis-common-protos  1.57.0           py39hf3d152e_0    conda-forge
gpustat                   1.0.0              pyhd8ed1ab_0    conda-forge
graphite2                 1.3.13            h58526e2_1001    conda-forge
grpc-cpp                  1.43.2               h9e046d8_3    conda-forge
grpcio                    1.43.0           py39hff7568b_0    conda-forge
gst-plugins-base          1.21.3               h4243ec0_1    conda-forge
gstreamer                 1.21.3               h25f0c4b_1    conda-forge
gstreamer-orc             0.4.33               h166bdaf_0    conda-forge
harfbuzz                  6.0.0                h8e241bc_0    conda-forge
hiredis                   2.0.0            py39hb9d737c_3    conda-forge
huggingface_hub           0.13.2             pyhd8ed1ab_0    conda-forge
hyperopt                  0.2.7              pyhd8ed1ab_0    conda-forge
icu                       70.1                 h27087fc_0    conda-forge
idna                      3.4                pyhd8ed1ab_0    conda-forge
importlib-metadata        6.0.0              pyha770c72_0    conda-forge
importlib_metadata        6.0.0                hd8ed1ab_0    conda-forge
importlib_resources       5.12.0             pyhd8ed1ab_0    conda-forge
ipython                   7.33.0           py39hf3d152e_0    conda-forge
jack                      1.9.22               h11f4161_0    conda-forge
jedi                      0.18.2             pyhd8ed1ab_0    conda-forge
joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
jpeg                      9e                   h0b41bf4_3    conda-forge
jsonschema                4.17.3             pyhd8ed1ab_0    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
kiwisolver                1.4.4            py39hf939315_1    conda-forge
krb5                      1.20.1               hf9c8cef_0    conda-forge
lame                      3.100             h166bdaf_1003    conda-forge
lcms2                     2.15                 hfd0df8a_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
lerc                      4.0.0                h27087fc_0    conda-forge
libblas                   3.9.0            12_linux64_mkl    conda-forge
libbrotlicommon           1.0.9                h166bdaf_8    conda-forge
libbrotlidec              1.0.9                h166bdaf_8    conda-forge
libbrotlienc              1.0.9                h166bdaf_8    conda-forge
libcap                    2.66                 ha37c62d_0    conda-forge
libcblas                  3.9.0            12_linux64_mkl    conda-forge
libclang                  15.0.7          default_had23c3d_1    conda-forge
libclang13                15.0.7          default_h3e3d535_1    conda-forge
libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
libcups                   2.3.3                h36d4200_3    conda-forge
libcurl                   7.87.0               h6312ad2_0    conda-forge
libdb                     6.2.32               h9c3ff4c_0    conda-forge
libdeflate                1.17                 h0b41bf4_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libevent                  2.1.10               h9b69904_4    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libflac                   1.4.2                h27087fc_0    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgcrypt                 1.10.1               h166bdaf_0    conda-forge
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libglib                   2.74.1               h606061b_1    conda-forge
libgoogle-cloud           1.36.0               h6945097_0    conda-forge
libgpg-error              1.46                 h620e276_0    conda-forge
libhwloc                  2.9.0                hd6dc26d_0    conda-forge
libiconv                  1.17                 h166bdaf_0    conda-forge
liblapack                 3.9.0            12_linux64_mkl    conda-forge
libllvm15                 15.0.7               hadd5161_1    conda-forge
libnghttp2                1.51.0               hdcd2b5c_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libogg                    1.3.4                h7f98852_1    conda-forge
libopus                   1.3.1                h7f98852_1    conda-forge
libpng                    1.6.39               h753d276_0    conda-forge
libpq                     15.1                 h2baec63_3    conda-forge
libprotobuf               3.19.4               h780b84a_0    conda-forge
libsndfile                1.2.0                hb75c966_0    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libssh2                   1.10.0               haa6b8db_3    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libsystemd0               252                  h2a991cd_0    conda-forge
libthrift                 0.16.0               h491838f_2    conda-forge
libtiff                   4.5.0                h6adf6a1_2    conda-forge
libtool                   2.4.7                h27087fc_0    conda-forge
libudev1                  253                  h0b41bf4_0    conda-forge
libunwind                 1.6.2                h9c3ff4c_0    conda-forge
libutf8proc               2.8.0                h166bdaf_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libvorbis                 1.3.7                h9c3ff4c_0    conda-forge
libwebp-base              1.3.0                h0b41bf4_0    conda-forge
libxcb                    1.13              h7f98852_1004    conda-forge
libxgboost                1.7.1            cpu_ha3b9936_0    conda-forge
libxkbcommon              1.5.0                h79f4944_1    conda-forge
libxml2                   2.10.3               hca2bb57_3    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
llvm-openmp               15.0.7               h0cdce71_0    conda-forge
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
magma                     2.5.4                hc72dce7_4    conda-forge
matplotlib                3.5.2            py39hf3d152e_1    conda-forge
matplotlib-base           3.5.2            py39h700656a_1    conda-forge
matplotlib-inline         0.1.6              pyhd8ed1ab_0    conda-forge
mkl                       2021.4.0           h8d4b97c_729    conda-forge
mpg123                    1.31.2               hcb278e6_0    conda-forge
mpmath                    1.3.0              pyhd8ed1ab_0    conda-forge
msgpack-python            1.0.5            py39h4b4f3f3_0    conda-forge
multidict                 6.0.4            py39h72bdee0_0    conda-forge
multiprocess              0.70.14          py39hb9d737c_3    conda-forge
munkres                   1.1.4              pyh9f0ad1d_0    conda-forge
mysql-common              8.0.32               h14678bc_0    conda-forge
mysql-libs                8.0.32               h54cf53e_0    conda-forge
nccl                      2.14.3.1             h0800d71_0    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
networkx                  3.0                pyhd8ed1ab_0    conda-forge
ninja                     1.11.1               h924138e_0    conda-forge
nipals                    0.5.5                    pypi_0    pypi
nspr                      4.35                 h27087fc_0    conda-forge
nss                       3.89                 he45b914_0    conda-forge
numpy                     1.24.2           py39h7360e5f_0    conda-forge
nvidia-ml-py              11.495.46          pyhd8ed1ab_0    conda-forge
opencensus                0.11.2             pyhd8ed1ab_0    conda-forge
opencensus-context        0.1.3            py39hf3d152e_1    conda-forge
openjpeg                  2.5.0                hfec8fc6_2    conda-forge
openssl                   1.1.1t               h0b41bf4_0    conda-forge
orc                       1.7.3                h1be678f_0    conda-forge
packaging                 23.0               pyhd8ed1ab_0    conda-forge
pandas                    1.4.2            py39h1832856_2    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
parso                     0.8.3              pyhd8ed1ab_0    conda-forge
pathtools                 0.1.2                      py_1    conda-forge
patsy                     0.5.3              pyhd8ed1ab_0    conda-forge
pcre2                     10.40                hc3806b6_0    conda-forge
pexpect                   4.8.0              pyh1a96a4e_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    9.4.0            py39h2320bf1_1    conda-forge
pip                       23.0.1             pyhd8ed1ab_0    conda-forge
pixman                    0.40.0               h36c2ea0_0    conda-forge
pkgutil-resolve-name      1.3.10             pyhd8ed1ab_0    conda-forge
platformdirs              3.1.1              pyhd8ed1ab_0    conda-forge
ply                       3.11                       py_1    conda-forge
pooch                     1.7.0              pyhd8ed1ab_0    conda-forge
powerlaw                  1.4.6              pyh9f0ad1d_1    conda-forge
prometheus_client         0.13.1             pyhd8ed1ab_0    conda-forge
promise                   2.3              py39hf3d152e_7    conda-forge
prompt-toolkit            3.0.38             pyha770c72_0    conda-forge
protobuf                  3.19.4           py39he80948d_0    conda-forge
psutil                    5.9.4            py39hb9d737c_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pulseaudio                16.1                 h4ab2085_1    conda-forge
py-spy                    0.3.14               h87a5ac0_0    conda-forge
py-xgboost                1.7.1           cpu_py39h4655687_0    conda-forge
py4j                      0.10.9.7           pyhd8ed1ab_0    conda-forge
pyarrow                   8.0.0           py39h42d110c_1_cpu    conda-forge
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.7                      py_0    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pygments                  2.14.0             pyhd8ed1ab_0    conda-forge
pyopenssl                 23.0.0             pyhd8ed1ab_0    conda-forge
pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
pyqt                      5.15.7           py39h5c7b992_3    conda-forge
pyqt5-sip                 12.11.0          py39h227be39_3    conda-forge
pyrsistent                0.19.3           py39h72bdee0_0    conda-forge
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
python                    3.9.15          h47a2c10_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-xxhash             3.2.0            py39h72bdee0_0    conda-forge
python_abi                3.9                      3_cp39    conda-forge
pytorch                   1.10.0          cuda112py39h3ad47f5_1    conda-forge
pytz                      2022.7.1           pyhd8ed1ab_0    conda-forge
pyu2f                     0.1.5              pyhd8ed1ab_0    conda-forge
pyyaml                    6.0              py39hb9d737c_5    conda-forge
qt-main                   5.15.6               h18908ee_6    conda-forge
ray-core                  1.13.0           py39hecbb631_2    conda-forge
ray-default               1.13.0           py39hf3d152e_2    conda-forge
re2                       2022.02.01           h9c3ff4c_0    conda-forge
readline                  8.1.2                h0f457ee_0    conda-forge
regex                     2022.10.31       py39hb9d737c_0    conda-forge
requests                  2.28.2             pyhd8ed1ab_0    conda-forge
responses                 0.18.0             pyhd8ed1ab_0    conda-forge
rsa                       4.9                pyhd8ed1ab_0    conda-forge
s2n                       1.0.10               h9b69904_0    conda-forge
sacremoses                0.0.53             pyhd8ed1ab_0    conda-forge
scikit-learn              1.1.1            py39h4037b75_0    conda-forge
scipy                     1.10.1           py39h7360e5f_0    conda-forge
screed                    1.0.5              pyhd8ed1ab_1    conda-forge
seaborn                   0.11.2               hd8ed1ab_0    conda-forge
seaborn-base              0.11.2             pyhd8ed1ab_0    conda-forge
sentencepiece             0.1.96           py39hf939315_1    conda-forge
sentry-sdk                1.17.0             pyhd8ed1ab_0    conda-forge
setproctitle              1.2.2            py39hb9d737c_2    conda-forge
setuptools                67.6.0             pyhd8ed1ab_0    conda-forge
shortuuid                 1.0.11             pyhd8ed1ab_0    conda-forge
sip                       6.7.7            py39h227be39_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sleef                     3.5.1                h9b69904_2    conda-forge
smart_open                6.3.0              pyhd8ed1ab_1    conda-forge
smmap                     3.0.5              pyh44b312d_0    conda-forge
snappy                    1.1.10               h9fff704_0    conda-forge
statsmodels               0.13.5           py39h2ae25f5_2    conda-forge
tabulate                  0.9.0              pyhd8ed1ab_1    conda-forge
tbb                       2021.8.0             hf52228f_0    conda-forge
threadpoolctl             3.1.0              pyh8a188c0_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tokenizers                0.12.1           py39h3045328_1    conda-forge
toml                      0.10.2             pyhd8ed1ab_0    conda-forge
tornado                   6.2              py39hb9d737c_1    conda-forge
tqdm                      4.64.0             pyhd8ed1ab_0    conda-forge
traitlets                 5.9.0              pyhd8ed1ab_0    conda-forge
transformers              4.23.1             pyhd8ed1ab_0    conda-forge
transformers-interpret    0.8.1              pyhd8ed1ab_0    conda-forge
typing-extensions         4.5.0                hd8ed1ab_0    conda-forge
typing_extensions         4.5.0              pyha770c72_0    conda-forge
tzdata                    2022g                h191b570_0    conda-forge
unicodedata2              15.0.0           py39hb9d737c_0    conda-forge
urllib3                   1.26.15            pyhd8ed1ab_0    conda-forge
virtualenv                20.21.0            pyhd8ed1ab_0    conda-forge
wandb                     0.13.4             pyhd8ed1ab_0    conda-forge
wcwidth                   0.2.6              pyhd8ed1ab_0    conda-forge
weightwatcher             0.6.4                      py_0    tyronechen
wheel                     0.40.0             pyhd8ed1ab_0    conda-forge
xcb-util                  0.4.0                h516909a_0    conda-forge
xcb-util-image            0.4.0                h166bdaf_0    conda-forge
xcb-util-keysyms          0.4.0                h516909a_0    conda-forge
xcb-util-renderutil       0.3.9                h166bdaf_0    conda-forge
xcb-util-wm               0.4.1                h516909a_0    conda-forge
xgboost                   1.7.1           cpu_py39h4655687_0    conda-forge
xkeyboard-config          2.38                 h0b41bf4_0    conda-forge
xorg-kbproto              1.0.7             h7f98852_1002    conda-forge
xorg-libice               1.0.10               h7f98852_0    conda-forge
xorg-libsm                1.2.3             hd9c2040_1000    conda-forge
xorg-libx11               1.8.4                h0b41bf4_0    conda-forge
xorg-libxau               1.0.9                h7f98852_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xorg-libxext              1.3.4                h0b41bf4_2    conda-forge
xorg-libxrender           0.9.10            h7f98852_1003    conda-forge
xorg-renderproto          0.11.1            h7f98852_1002    conda-forge
xorg-xextproto            7.3.0             h0b41bf4_1003    conda-forge
xorg-xproto               7.0.31            h7f98852_1007    conda-forge
xxhash                    0.8.1                h0b41bf4_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
yarl                      1.8.2            py39hb9d737c_0    conda-forge
yellowbrick               1.3.post1          pyhd8ed1ab_1    conda-forge
zipp                      3.15.0             pyhd8ed1ab_0    conda-forge
ziran                     1.0.9                         0    tyronechen
zlib                      1.2.13               h166bdaf_4    conda-forge
zstd                      1.5.2                h3eb15da_6    conda-forge

Expected behavior A clear and concise description of what you expected to happen.

csv files should not have truncated array values.

Suggested fix If known.

Temporary fix: Use parquet and json files as input for training since these are unaffected.

Long term fix: Increase the array size limit for printing on pandas and/or numpy.

Screenshots If applicable, add screenshots to help explain your problem.

Not applicable

tyronechen commented 1 year ago

@hannaglad https://github.com/tyronechen/genomenlp/issues/3#issuecomment-1661833067

Added np.set_printoptions(threshold=np.inf) to utils.py