huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
3.63k stars 441 forks source link

[BUG] TypeError: can only concatenate list (not "str") to list #688

Closed Milkyroad closed 1 week ago

Milkyroad commented 1 week ago

Prerequisites

Backend

Hugging Face Space/Endpoints

Interface Used

UI

CLI Command

No response

UI Screenshots & Parameters

No response

Error Logs

Downloading data: 0%| | 0.00/1.93G [00:00<?, ?B/s]INFO: 10.16.3.126:63563 - "GET /ui/is_model_training HTTP/1.1" 200 OK

Downloading data: 1%| | 10.5M/1.93G [00:01<03:53, 8.20MB/s] Downloading data: 2%|▏ | 31.5M/1.93G [00:01<01:16, 24.8MB/s] Downloading data: 3%|▎ | 52.4M/1.93G [00:01<00:42, 43.8MB/s] Downloading data: 4%|▍ | 73.4M/1.93G [00:01<00:28, 65.6MB/s] Downloading data: 11%|█▏ | 220M/1.93G [00:01<00:05, 287MB/s]
Downloading data: 16%|█▋ | 315M/1.93G [00:02<00:04, 377MB/s] Downloading data: 20%|█▉ | 377M/1.93G [00:02<00:05, 267MB/s] Downloading data: 24%|██▍ | 472M/1.93G [00:02<00:03, 365MB/s] Downloading data: 28%|██▊ | 535M/1.93G [00:02<00:04, 291MB/s] Downloading data: 30%|███ | 587M/1.93G [00:03<00:04, 284MB/s] Downloading data: 33%|███▎ | 640M/1.93G [00:03<00:04, 287MB/s] Downloading data: 41%|████▏ | 797M/1.93G [00:03<00:02, 500MB/s] Downloading data: 45%|████▌ | 870M/1.93G [00:03<00:02, 478MB/s] Downloading data: 49%|████▉ | 954M/1.93G [00:03<00:01, 544MB/s] Downloading data: 54%|█████▍ | 1.05G/1.93G [00:03<00:01, 628MB/s] Downloading data: 62%|██████▏ | 1.20G/1.93G [00:03<00:00, 739MB/s] Downloading data: 66%|██████▋ | 1.28G/1.93G [00:04<00:01, 509MB/s] Downloading data: 71%|███████ | 1.37G/1.93G [00:04<00:01, 525MB/s] Downloading data: 78%|███████▊ | 1.51G/1.93G [00:04<00:00, 606MB/s] Downloading data: 84%|████████▍ | 1.62G/1.93G [00:04<00:00, 701MB/s] Downloading data: 90%|█████████ | 1.74G/1.93G [00:04<00:00, 789MB/s] Downloading data: 98%|█████████▊| 1.89G/1.93G [00:04<00:00, 932MB/s] Downloading data: 100%|█████████▉| 1.93G/1.93G [00:04<00:00, 392MB/s]

Downloading data: 0%| | 0.00/482M [00:00<?, ?B/s]INFO: 10.16.3.126:60010 - "GET /ui/is_model_training HTTP/1.1" 200 OK INFO: 10.16.24.60:51158 - "GET /ui/accelerators HTTP/1.1" 200 OK

Downloading data: 2%|▏ | 10.5M/482M [00:00<00:29, 16.0MB/s] Downloading data: 4%|▍ | 21.0M/482M [00:00<00:15, 30.2MB/s] Downloading data: 7%|▋ | 31.5M/482M [00:01<00:13, 34.4MB/s] Downloading data: 13%|█▎ | 62.9M/482M [00:01<00:05, 81.7MB/s] Downloading data: 24%|██▍ | 115M/482M [00:01<00:02, 168MB/s]
Downloading data: 33%|███▎ | 157M/482M [00:01<00:01, 221MB/s] Downloading data: 63%|██████▎ | 304M/482M [00:01<00:00, 456MB/s] Downloading data: 85%|████████▍ | 409M/482M [00:01<00:00, 590MB/s] Downloading data: 100%|█████████▉| 482M/482M [00:01<00:00, 290MB/s]

Generating train split: 0 examples [00:00, ? examples/s]INFO: 10.16.19.59:35945 - "GET /ui/is_model_training HTTP/1.1" 200 OK

Generating train split: 10000 examples [00:08, 1246.87 examples/s]INFO: 10.16.3.126:40379 - "GET /ui/accelerators HTTP/1.1" 200 OK INFO: 10.16.46.197:26161 - "GET /ui/is_model_training HTTP/1.1" 200 OK INFO: 10.16.24.60:54262 - "GET /ui/is_model_training HTTP/1.1" 200 OK

Generating train split: 20000 examples [00:16, 1240.51 examples/s]INFO: 10.16.24.60:10323 - "GET /ui/is_model_training HTTP/1.1" 200 OK INFO: 10.16.3.126:48785 - "GET /ui/accelerators HTTP/1.1" 200 OK

Generating train split: 26400 examples [00:21, 1221.36 examples/s] Generating train split: 26400 examples [00:21, 1227.53 examples/s]

Generating test split: 0 examples [00:00, ? examples/s]INFO: 10.16.3.126:45021 - "GET /ui/is_model_training HTTP/1.1" 200 OK

Generating test split: 6601 examples [00:05, 1228.95 examples/s] Generating test split: 6601 examples [00:05, 1227.73 examples/s] INFO: 10.16.3.126:50194 - "GET /ui/is_model_training HTTP/1.1" 200 OK INFO: 10.16.24.60:34314 - "GET /ui/accelerators HTTP/1.1" 200 OK ERROR | 2024-06-26 11:25:49 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/autotrain/trainers/common.py", line 117, in wrapper return func(*args, **kwargs) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/tabular/main.py", line 189, in train [config.id_column] + config.target_columns if config.id_column is not None else config.target_columns TypeError: can only concatenate list (not "str") to list

ERROR | 2024-06-26 11:25:49 | autotrain.trainers.common:wrapper:121 - can only concatenate list (not "str") to list INFO | 2024-06-26 11:25:49 | autotrain.trainers.common:pause_space:77 - Pausing space...

Additional Information

I am trying to use autotrain for tabular classification on my huggingface dataset. However, it always shows

TypeError: can only concatenate list (not "str") to list

I have added id and target to the dataset as per the documentation, but the problem persists. Any idea what is happening?

abhishekkrthakur commented 1 week ago

could you please share screenshot of the UI so i can take a look at parameters and column mapping that was used?

Milkyroad commented 1 week ago

image

I didn't touch anything other than the train and valid fields. I also tried editing the numerical columns by entering the features separated by commas, but it told me it is not a list. Thanks for you help 🙏

abhishekkrthakur commented 1 week ago

which dataset are you using?

Milkyroad commented 1 week ago

It is a private dataset with features extracted from VGGFace, along with the id and target added to it. image

abhishekkrthakur commented 1 week ago

thanks. i believe it is csv/jsonl dataset?

Milkyroad commented 1 week ago

Yep, it is a csv dataset. I also tried to upload using push_to_hub which converted it to .parquet but the problem still persists.

abhishekkrthakur commented 1 week ago

thanks for all the information. im taking a look and will come back to you as soon as possible.

abhishekkrthakur commented 1 week ago

with all the parameters as shown in your screenshot, i was able to successfully train a model. here's is what my CSV looked like:

id,category1,category2,feature1,target
1,A,X,0.3373961604172684,1
2,B,Z,0.6481718720511972,0
3,A,Y,0.36824153984054797,1
4,B,Z,0.9571551589530464,1
5,B,Z,0.14035078041264515,1
6,C,X,0.8700872583584364,1
7,A,Y,0.4736080452737105,0
8,C,Y,0.8009107519796442,1
9,A,Y,0.5204774795512048,0
10,A,Y,0.6788795301189603,0

But i believe I might know what went wrong. if you use hub dataset, you should have both training and test/valid splits. but from your logs, it seems like there are no test/validation samples. there is no auto-splitting for hub datasets yet. ill take a look into it but it might take a while.

Could you try factory rebuilding the autotrain space and uploading your training data as csv file and train the model? (validation data/split is optional when uploading the data)

Milkyroad commented 1 week ago

Okay so I tried again

'feature_4092', 'feature_4093', 'feature_4094', 'feature_4095', 'feature_4096']
INFO     | 2024-06-26 14:02:48 | __main__:train:266 - Preprocessor: ColumnTransformer(n_jobs=-1,
                  transformers=[('numeric',
                                 Pipeline(steps=[('num_imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('num_scaler',
                                                  RobustScaler())]),
                                 ['feature_1', 'feature_2', 'feature_3',
                                  'feature_4', 'feature_5', 'feature_6',
                                  'feature_7', 'feature_8', 'feature_9',
                                  'feature_10', 'feature_11', 'feature_12',
                                  'feature_13', 'feature_14', 'feature_15',
                                  'feature_16', 'feature_17', 'feature_18',
                                  'feature_19', 'feature_20', 'feature_21',
                                  'feature_22', 'feature_23', 'feature_24',
                                  'feature_25', 'feature_26', 'feature_27',
                                  'feature_28', 'feature_29', 'feature_30', ...]),
                                ('categorical',
                                 Pipeline(steps=[('cat_imputer',
                                                  SimpleImputer(strategy='most_frequent'))]),
                                 [])],
                  verbose=True)
INFO     | 2024-06-26 14:02:48 | __main__:train:291 - Sub task: binary_classification
[I 2024-06-26 14:02:48,870] A new study created in memory with name: AutoTrain
[ColumnTransformer] ....... (1 of 1) Processing numeric, total=   0.3s
/app/env/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1271: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 2.
  warnings.warn(
INFO     | 2024-06-26 14:02:50 | __main__:optimize:122 - Metrics: {'auc': 0.5, 'logloss': 0.6931471805599453, 'f1': 0.8, 'accuracy': 0.6666666666666666, 'precision': 0.6666666666666666, 'recall': 1.0, 'loss': 0.6931471805599453}
[I 2024-06-26 14:02:50,932] Trial 0 finished with value: 0.6931471805599453 and parameters: {'C': 0.050249608515686287, 'fit_intercept': True, 'solver': 'liblinear', 'penalty': 'l1'}. Best is trial 0 with value: 0.6931471805599453.
[ColumnTransformer] ....... (1 of 1) Processing numeric, total=   0.4s
/app/env/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
INFO     | 2024-06-26 14:02:51 | __main__:optimize:122 - Metrics: {'auc': 0.875, 'logloss': 0.28233999916480795, 'f1': 0.75, 'accuracy': 0.6666666666666666, 'precision': 0.75, 'recall': 0.75, 'loss': 0.28233999916480795}
[I 2024-06-26 14:02:51,579] Trial 1 finished with value: 0.28233999916480795 and parameters: {'C': 54.0012023479519, 'fit_intercept': True, 'solver': 'saga', 'penalty': 'l1'}. Best is trial 1 with value: 0.28233999916480795.
[ColumnTransformer] ....... (1 of 1) Processing numeric, total=   0.3s
/app/env/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
INFO     | 2024-06-26 14:02:52 | __main__:optimize:122 - Metrics: {'auc': 0.875, 'logloss': 0.28731842247097966, 'f1': 0.75, 'accuracy': 0.6666666666666666, 'precision': 0.75, 'recall': 0.75, 'loss': 0.28731842247097966}
[I 2024-06-26 14:02:52,079] Trial 2 finished with value: 0.28731842247097966 and parameters: {'C': 0.05883502425308892, 'fit_intercept': True, 'solver': 'saga', 'penalty': 'l2'}. Best is trial 1 with value: 0.28233999916480795.
INFO:     10.16.19.59:58753 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.3.126:49496 - "GET /ui/is_model_training HTTP/1.1" 200 OK
[ColumnTransformer] ....... (1 of 1) Processing numeric, total=   0.3s
INFO     | 2024-06-26 14:02:52 | __main__:optimize:122 - Metrics: {'auc': 0.5, 'logloss': 0.6931471805599453, 'f1': 0.8, 'accuracy': 0.6666666666666666, 'precision': 0.6666666666666666, 'recall': 1.0, 'loss': 0.6931471805599453}
[I 2024-06-26 14:02:52,449] Trial 3 finished with value: 0.6931471805599453 and parameters: {'C': 0.0010280499524346308, 'fit_intercept': False, 'solver': 'saga', 'penalty': 'l1'}. Best is trial 1 with value: 0.28233999916480795.
[ColumnTransformer] ....... (1 of 1) Processing numeric, total=   0.3s
/app/env/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1271: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 2.
  warnings.warn(
INFO     | 2024-06-26 14:02:52 | __main__:optimize:122 - Metrics: {'auc': 1.0, 'logloss': 0.0739307783924105, 'f1': 1.0, 'accuracy': 1.0, 'precision': 1.0, 'recall': 1.0, 'loss': 0.0739307783924105}
[I 2024-06-26 14:02:52,801] Trial 4 finished with value: 0.0739307783924105 and parameters: {'C': 270.39233714986204, 'fit_intercept': False, 'solver': 'liblinear', 'penalty': 'l1'}. Best is trial 4 with value: 0.0739307783924105.
[ColumnTransformer] ....... (1 of 1) Processing numeric, total=   0.3s
INFO     | 2024-06-26 14:02:53 | __main__:optimize:122 - Metrics: {'auc': 0.5, 'logloss': 0.6931471805599453, 'f1': 0.8, 'accuracy': 0.6666666666666666, 'precision': 0.6666666666666666, 'recall': 1.0, 'loss': 0.6931471805599453}
[I 2024-06-26 14:02:53,163] Trial 5 finished with value: 0.6931471805599453 and parameters: {'C': 2.6688300488207432e-05, 'fit_intercept': False, 'solver': 'saga', 'penalty': 'l1'}. Best is trial 4 with value: 0.0739307783924105.
[ColumnTransformer] ....... (1 of 1) Processing numeric, total=   0.3s
INFO     | 2024-06-26 14:02:53 | __main__:optimize:122 - Metrics: {'auc': 0.5, 'logloss': 0.6929925713903798, 'f1': 0.8, 'accuracy': 0.6666666666666666, 'precision': 0.6666666666666666, 'recall': 1.0, 'loss': 0.6929925713903798}
[I 2024-06-26 14:02:53,520] Trial 6 finished with value: 0.6929925713903798 and parameters: {'C': 3.450892455858824e-05, 'fit_intercept': True, 'solver': 'saga', 'penalty': 'l1'}. Best is trial 4 with value: 0.0739307783924105.
[ColumnTransformer] ....... (1 of 1) Processing numeric, total=   0.3s
/app/env/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1271: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 2.
  warnings.warn(
INFO     | 2024-06-26 14:02:53 | __main__:optimize:122 - Metrics: {'auc': 1.0, 'logloss': 0.14204618492002583, 'f1': 0.8571428571428571, 'accuracy': 0.8333333333333334, 'precision': 1.0, 'recall': 0.75, 'loss': 0.14204618492002583}
[I 2024-06-26 14:02:53,885] Trial 7 finished with value: 0.14204618492002583 and parameters: {'C': 493.35468416325705, 'fit_intercept': False, 'solver': 'liblinear', 'penalty': 'l1'}. Best is trial 4 with value: 0.0739307783924105.
[ColumnTransformer] ....... (1 of 1) Processing numeric, total=   0.3s
/app/env/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1271: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 2.
  warnings.warn(
INFO     | 2024-06-26 14:02:54 | __main__:optimize:122 - Metrics: {'auc': 0.875, 'logloss': 0.6625963660861149, 'f1': 0.75, 'accuracy': 0.6666666666666666, 'precision': 0.75, 'recall': 0.75, 'loss': 0.6625963660861149}
[I 2024-06-26 14:02:54,238] Trial 8 finished with value: 0.6625963660861149 and parameters: {'C': 6.370084223478923, 'fit_intercept': False, 'solver': 'liblinear', 'penalty': 'l1'}. Best is trial 4 with value: 0.0739307783924105.
[ColumnTransformer] ....... (1 of 1) Processing numeric, total=   0.3s
/app/env/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1271: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 2.
  warnings.warn(
INFO     | 2024-06-26 14:02:54 | __main__:optimize:122 - Metrics: {'auc': 1.0, 'logloss': 0.07601203379370798, 'f1': 1.0, 'accuracy': 1.0, 'precision': 1.0, 'recall': 1.0, 'loss': 0.07601203379370798}
[I 2024-06-26 14:02:54,605] Trial 9 finished with value: 0.07601203379370798 and parameters: {'C': 270.9366654807268, 'fit_intercept': True, 'solver': 'liblinear', 'penalty': 'l2'}. Best is trial 4 with value: 0.0739307783924105.
INFO     | 2024-06-26 14:02:54 | __main__:train:309 - Best params: {'C': 270.39233714986204, 'fit_intercept': False, 'solver': 'liblinear', 'penalty': 'l1'}
[ColumnTransformer] ....... (1 of 1) Processing numeric, total=   0.3s
INFO     | 2024-06-26 14:02:54 | __main__:optimize:122 - Metrics: {'auc': 1.0, 'logloss': 0.1193098511039355, 'f1': 1.0, 'accuracy': 1.0, 'precision': 1.0, 'recall': 1.0, 'loss': 0.1193098511039355}
INFO     | 2024-06-26 14:02:54 | __main__:train:344 - Pushing model to hub...

  0%|          | 0/2 [00:00<?, ?it/s]

model.joblib:   0%|          | 0.00/384k [00:00<?, ?B/s]
model.joblib: 100%|██████████| 384k/384k [00:00<00:00, 1.70MB/s]

 50%|█████     | 1/2 [00:00<00:00,  3.68it/s]

target_encoders.joblib:   0%|          | 0.00/376 [00:00<?, ?B/s]
target_encoders.joblib: 100%|██████████| 376/376 [00:00<00:00, 12.8kB/s]

100%|██████████| 2/2 [00:00<00:00,  5.56it/s]
100%|██████████| 2/2 [00:00<00:00,  5.16it/s]
INFO     | 2024-06-26 14:02:55 | autotrain.trainers.common:pause_space:77 - Pausing space...

It seemed to be able to run and it gave me this model: image

After that, the space pauses on its own. This is the first time I've used autotrain, so I'm not sure if this is the expected behaviour.

This is how my dataset structured on hub image

I have also tried to upload my entire training set to autotrain, but it is taking too long (longer than uploading to hub dataset), so I decided to test with 30 rows first.

abhishekkrthakur commented 1 week ago

AutoTrain pauses itself to save resources (and in some cases money when using bigger instances) for the end user, when there is a failure or success. it seems like your model succeeded.

regarding hub dataset, the way you uploaded it doesnt seem to be correct. you need to upload different splits and target needs to be ClassLabel. take a look here: https://huggingface.co/docs/datasets/en/tabular_load

when pushing the dataset to hub, you also need to provide splits and convert target column to ClassLabel. some useful code can be found here: https://github.com/huggingface/autotrain-advanced/blob/main/src/autotrain/preprocessor/tabular.py

Milkyroad commented 1 week ago

Ah, thanks for your help! I'll close this issue then.

abhishekkrthakur commented 1 week ago

just wanted to mention: the upload might take time for large datasets and depends on your internet speed. it will succeed if you let it be for a while. however, if you upload dataset on the hub, it can then be re-used quite easily. ill see if i can make a space to convert dataset to hf hub dataset easily in the coming days