Fields input_tokens, token_type_ids, attention_mask are truncated if the feature is too long. This is true for output csv file only.
# sample run on arbitrary file with very long item
create_dataset_bio <infile_path_1> <infile_path_2> <tokeniser>
# sample output csv file
some_seq,<very very long sequence>,1,"[10 ... 20]","[0 ... 0]","[1 ... 1]"
Please make sure to include environment info including python and dependency versions. You can access this with pip freeze or conda list as needed.
Describe the bug Tokenised transformers dataset object csv files are truncated if the sequence is too long.
To Reproduce Please provide a minimal reproducible example with all steps to reproduce the behaviour before submitting an issue:
Fields
input_tokens
,token_type_ids
,attention_mask
are truncated if thefeature
is too long. This is true for outputcsv
file only.Please make sure to include environment info including python and dependency versions. You can access this with
pip freeze
orconda list
as needed.Expected behavior A clear and concise description of what you expected to happen.
csv
files should not have truncated array values.Suggested fix If known.
Temporary fix: Use
parquet
andjson
files as input for training since these are unaffected.Long term fix: Increase the array size limit for printing on
pandas
and/ornumpy
.Screenshots If applicable, add screenshots to help explain your problem.
Not applicable