microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.51k stars 3.82k forks source link

Any colon in a CSV or TSV file fools the parser #4180

Open citron opened 3 years ago

citron commented 3 years ago

Description

Using the CLI version of lightGBM, If your input data contains colons ( as 2015-05-19T19:16:02UTC for instance ), the file format is interpreted as libsvm instead of CSV or TSV

LightGBM version or commit hash: v3.2.1

Command(s) you used to install LightGBM

git clone --recursive https://github.com/microsoft/LightGBM
cd LightGBM
mkdir build
cd build
cmake ..
make -j4

Additional Comments

The source of the problem is in the implementation of the function : DataType GetDataType(const char* filename, bool header, const std::vector<std::string>& lines, int* num_col) in file parser.cpp

jameslamb commented 3 years ago

Thanks for using LightGBM! Could you please provide a small reproducible example? A small dataset (or code to create it) + code that we could run to replicate the issue?

citron commented 3 years ago

Hello please find here the requested small dataset :

AAA BBB CCC DDD EEE
0.27 a 2015-01-01T05:25:00UTC [1;2;3] 0
0.07 b 2015-01-03T05:25:00UTC [4;2;3] 0
0.25 c 2015-01-01T05:25:00UTC [2;3] 1
0.04 a 2015-01-02T05:25:00UTC [5;2;3] 1
0.43 b 2015-01-01T05:25:00UTC [1;2] 1
0.73 a 2015-01-02T05:25:00UTC [1;2;3] 0
0.55 c 2015-01-02T05:25:00UTC [6;2;3] 1
0.04 b 2015-01-01T05:25:00UTC [1;2;3] 1
0.26 a 2015-01-01T05:25:00UTC [1;9;3] 0
0.66 b 2015-01-01T05:25:00UTC [1;2;3] 0
0.26 a 2015-01-03T05:25:00UTC [1;3] 1
0.26 c 2015-01-01T05:25:00UTC [1;2;3] 1
0.09 c 2015-01-03T05:25:00UTC [1;2;3] 1
0.86 a 2015-01-03T05:25:00UTC [1;2] 0
0.32 b 2015-01-03T05:25:00UTC [1;2;3] 0
0.09 a 2015-01-01T05:25:00UTC [1;8;3] 1

TABs do not fit with github markdown so please save the sample convert it with : tr ' ' '\t'

and here comes the lightGBM conf :

task = train
boosting_type = gbdt
objective = binary
metric = binary_logloss,auc
metric_freq = 1
is_training_metric = true
max_bin = 255

data = train.tsv
valid_data = test.tsv

num_trees = 10
learning_rate = 0.1
num_leaves = 5

tree_learner = serial
feature_fraction = 0.8

bagging_freq = 5

bagging_fraction = 0.8

min_sum_hessian_in_leaf = 5.0

is_enable_sparse = true
use_two_round_loading = false

is_save_binary_file = false

output_model = LightGBM_model.txt

header = true
label_column = 5
ignore_column = 3,4
categorical_feature = 2
StrikerRUS commented 3 years ago

cc @shiyu1994

shiyu1994 commented 3 years ago

The logic to determine the use of CSV, TSV or LibSVM is simple, but currently we prioritize LibSVM first. That is, as long as a colon is found in the line, LibSVM will be used. https://github.com/microsoft/LightGBM/blob/08d1ce4bbf9391f5cc945f5f015840f4a57774a6/src/io/parser.cpp#L200-L206 Also, so far we only support integer/floating point input feature values, so this case happens when header=true in a CSV or TSV file and the feature name contains colons. In that case, exchanging the order of if-else may solve the problem. Since it would be more natural for a feature name to contain colons than to contain tabs or commas. Also, I think it would be a good feature request to support any format for categorical feature import. We can just treat them as strings and remap them into district integers.

StrikerRUS commented 3 years ago

@shiyu1994

Also, I think it would be a good feature request to support any format for categorical feature import. We can just treat them as strings and remap them into district integers.

Do you mean this?

789

We have this as a feature request.

shiyu1994 commented 3 years ago

@StrikerRUS, exactly.

shiyu1994 commented 3 years ago

For this case, the example contains colon in its feature values. But since currently we only allow integer and floating point number feature values, this example dataset won't get expected behavior because the date strings will be treated as NaN's. However, we can exchange the order of if-else to support colons in the feature names in CSV or TSV files.