sberbank-ai-lab / LightAutoML

LAMA - automatic model creation framework
Apache License 2.0
886 stars 92 forks source link

Feature importances in TabularNLPAutoML #129

Open fingoldo opened 2 years ago

fingoldo commented 2 years ago

Hi, is it possible to get feature importances in TabularNLPAutoML for regular features (not textual), same as in TabularAutoML? Currently automl.get_feature_scores("fast")is throwing an error


AttributeError                            Traceback (most recent call last)
<ipython-input-182-f726a358fe6b> in <module>
      1 # Fast feature importances calculation
----> 2 fast_fi = pipe.base_estimator.get_feature_scores("fast")
      3 fast_fi.set_index("Feature")["Importance"].plot.bar(figsize=(30, 10), grid=True)

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\automl\presets\tabular_presets.py in get_feature_scores(self, calc_method, data, features_names, silent)
    577     ):
    578         if calc_method == "fast":
--> 579             for level in self.levels:
    580                 for pipe in level:
    581                     fi = pipe.pre_selection.get_features_score()

AttributeError: 'TabularNLPAutoML' object has no attribute 'levels'
alexmryzhkov commented 2 years ago

Hi @fingoldo,

Thanks for the issue. Could you also share the code how you setup task, roles and TabularNLPAutoml with the full training log as well?

Alex

fingoldo commented 2 years ago

Thanks for the the quick reply, Alex! Sure. Basically, it's this:

N_THREADS = multiprocessing.cpu_count()
MEMORY_LIMIT = psutil.virtual_memory().total * 0.9 / 1024 ** 3
verbose = 1
task = Task("reg", loss="mse", metric="mae")
timeout = 60 * 60 * 3
automl=TabularNLPAutoML(task=task, timeout=timeout, cpu_limit=N_THREADS, gpu_ids="all", text_params={"lang": "en"},)

automl.fit_predict(X,roles={"text": ["title"], "drop": [], "target": TARGET_COLUMN})

the log:

[14:43:54] Stdout logging level is INFO.

2022-03-27 14:43:54,513 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - set_verbosity_level-line:267 - Stdout logging level is INFO.
2022-03-27 14:43:54,535 - INFO3 - MainProcess[19272]-MainThread[19072]-text_presets.py-lightautoml.automl.presets.text_presets - infer_auto_params-line:230 - Model language mode: en

[14:43:54] Task: reg

2022-03-27 14:43:54,556 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:196 - Task: reg

[14:43:54] Start automl preset with listed constraints:

2022-03-27 14:43:54,558 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:198 - Start automl preset with listed constraints:

[14:43:54] - time: 10800.00 seconds

2022-03-27 14:43:54,559 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:199 - - time: 10800.00 seconds

[14:43:54] - CPU: 32 cores

2022-03-27 14:43:54,561 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:200 - - CPU: 32 cores

[14:43:54] - memory: 16 GB

2022-03-27 14:43:54,563 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:201 - - memory: 16 GB

[14:43:54] Train data shape: (9000, 290)

2022-03-27 14:43:54,565 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.reader.base - fit_read-line:274 - Train data shape: (9000, 290)

2022-03-27 14:43:57,354 - INFO3 - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.reader.base - advanced_roles_guess-line:607 - Feats was rejected during automatic roles guess: []

[14:43:57] Layer 1 train process start. Time left 10797.12 secs

2022-03-27 14:43:57,443 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.base - fit_predict-line:213 - Layer 1 train process start. Time left 10797.12 secs

[14:44:02] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...

2022-03-27 14:44:02,316 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.ml_algo.base - fit_predict-line:245 - Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...

[14:44:05] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = -940.749755859375

2022-03-27 14:44:05,244 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.ml_algo.base - fit_predict-line:293 - Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = -940.749755859375

[14:44:05] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed

2022-03-27 14:44:05,246 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.ml_algo.base - fit_predict-line:296 - Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed

[14:44:05] Time left 10789.31 secs

2022-03-27 14:44:05,257 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.base - fit_predict-line:223 - Time left 10789.31 secs

2022-03-27 14:44:06,717 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'params': 'FastText(vocab=0, vector_size=64, alpha=0.025)', 'datetime': '2022-03-27T14:44:06.717633', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'created'}
2022-03-27 14:44:06,725 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - scan_vocab-line:578 - collecting all words and their counts
2022-03-27 14:44:06,726 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _scan_vocab-line:561 - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-03-27 14:44:06,745 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - scan_vocab-line:584 - collected 10828 word types from a corpus of 46369 raw words and 9000 sentences
2022-03-27 14:44:06,746 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - prepare_vocab-line:633 - Creating a fresh vocabulary
2022-03-27 14:44:06,824 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'msg': 'effective_min_count=1 retains 10828 unique words (100.0%% of original 10828, drops 0)', 'datetime': '2022-03-27T14:44:06.824618', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'prepare_vocab'}
2022-03-27 14:44:06,825 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'msg': 'effective_min_count=1 leaves 46369 word corpus (100.0%% of original 46369, drops 0)', 'datetime': '2022-03-27T14:44:06.825618', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'prepare_vocab'}
2022-03-27 14:44:06,968 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - prepare_vocab-line:741 - deleting the raw counts dictionary of 10828 items
2022-03-27 14:44:06,969 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - prepare_vocab-line:744 - sample=0.001 downsamples 40 most-common words
2022-03-27 14:44:06,970 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'msg': 'downsampling leaves estimated 40640.463918984155 word corpus (87.6%% of prior 46369)', 'datetime': '2022-03-27T14:44:06.970622', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'prepare_vocab'}
2022-03-27 14:44:07,295 - INFO - MainProcess[19272]-MainThread[19072]-fasttext.py-gensim.models.fasttext - estimate_memory-line:493 - estimated required memory for 10828 words, 2000000 buckets and 64 dimensions: 525048308 bytes
2022-03-27 14:44:07,296 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - init_weights-line:859 - resetting layer weights
2022-03-27 14:44:09,287 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2022-03-27T14:44:09.287742', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'build_vocab'}
2022-03-27 14:44:09,289 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'msg': 'training model with 3 workers on 10828 vocabulary and 64 features, using sg=0 hs=0 sample=0.001 negative=5 window=3 shrink_windows=True', 'datetime': '2022-03-27T14:44:09.289723', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'train'}
2022-03-27 14:44:09,376 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_progress-line:1288 - worker thread finished; awaiting finish of 2 more threads
2022-03-27 14:44:09,409 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_progress-line:1288 - worker thread finished; awaiting finish of 1 more threads
2022-03-27 14:44:09,414 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_progress-line:1288 - worker thread finished; awaiting finish of 0 more threads
2022-03-27 14:44:09,414 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_end-line:1629 - EPOCH - 1 : training on 46369 raw words (40640 effective words) took 0.1s, 404546 effective words/s
2022-03-27 14:44:09,500 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_progress-line:1288 - worker thread finished; awaiting finish of 2 more threads
2022-03-27 14:44:09,531 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_progress-line:1288 - worker thread finished; awaiting finish of 1 more threads
2022-03-27 14:44:09,544 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_progress-line:1288 - worker thread finished; awaiting finish of 0 more threads
2022-03-27 14:44:09,545 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_end-line:1629 - EPOCH - 2 : training on 46369 raw words (40644 effective words) took 0.1s, 350692 effective words/s
2022-03-27 14:44:09,546 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'msg': 'training on 92738 raw words (81284 effective words) took 0.3s, 317320 effective words/s', 'datetime': '2022-03-27T14:44:09.546730', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'train'}
100%|████████████████████████████████████████████████████████████████████████████| 9000/9000 [00:07<00:00, 1273.13it/s]
2022-03-27 14:44:18,279 - INFO3 - MainProcess[19272]-MainThread[19072]-text.py-lightautoml.transformers.text - fit-line:788 - Feature concated__title fitted
2022-03-27 14:44:24,936 - INFO3 - MainProcess[19272]-MainThread[19072]-text.py-lightautoml.transformers.text - transform-line:834 - Feature concated__title transformed

[14:44:24] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...

2022-03-27 14:44:24,992 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.ml_algo.base - fit_predict-line:245 - Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...

[14:44:36] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = -924.1246948242188

2022-03-27 14:44:36,807 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.ml_algo.base - fit_predict-line:293 - Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = -924.1246948242188

[14:44:36] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed

2022-03-27 14:44:36,809 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.ml_algo.base - fit_predict-line:296 - Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed

[14:44:36] Time left 10757.75 secs

2022-03-27 14:44:36,816 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.base - fit_predict-line:223 - Time left 10757.75 secs

[14:44:36] Layer 1 training completed.

2022-03-27 14:44:36,818 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.base - fit_predict-line:241 - Layer 1 training completed.

[14:44:36] Blending: optimization starts with equal weights and score -924.7379150390625

2022-03-27 14:44:36,827 - INFO - MainProcess[19272]-MainThread[19072]-blend.py-lightautoml.automl.blend - _optimize-line:370 - Blending: optimization starts with equal weights and score -924.7379150390625

[14:44:36] Blending: iteration 0: score = -922.67333984375, weights = [0.25724643 0.74275357]

2022-03-27 14:44:36,850 - INFO - MainProcess[19272]-MainThread[19072]-blend.py-lightautoml.automl.blend - _optimize-line:395 - Blending: iteration 0: score = -922.67333984375, weights = [0.25724643 0.74275357]

[14:44:36] Blending: iteration 1: score = -922.67333984375, weights = [0.25724643 0.74275357]

2022-03-27 14:44:36,873 - INFO - MainProcess[19272]-MainThread[19072]-blend.py-lightautoml.automl.blend - _optimize-line:395 - Blending: iteration 1: score = -922.67333984375, weights = [0.25724643 0.74275357]

[14:44:36] Blending: no score update. Terminated

2022-03-27 14:44:36,875 - INFO - MainProcess[19272]-MainThread[19072]-blend.py-lightautoml.automl.blend - _optimize-line:402 - Blending: no score update. Terminated

[14:44:36] Automl preset training completed in 42.32 seconds

2022-03-27 14:44:36,883 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:214 - Automl preset training completed in 42.32 seconds

[14:44:36] Model description:
Final prediction for new objects (level 0) = 
     0.25725 * (3 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
     0.74275 * (3 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) 

2022-03-27 14:44:36,885 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:215 - Model description:
Final prediction for new objects (level 0) = 
     0.25725 * (3 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
     0.74275 * (3 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) 
alexmryzhkov commented 2 years ago

Hi @fingoldo,

I have checked the situation and the result is that in TabularNLPAutoML preset we don't use feature selector (because it will be pretty slow for this case) - that's why we can't show the fast feature importances. Could you please try use the accurate method instead of fast?

Alex