sberbank-ai-lab / LightAutoML

LAMA - automatic model creation framework
Apache License 2.0
887 stars 92 forks source link

Problem with TabularNLPAutoML #76

Closed qvviko closed 2 years ago

qvviko commented 2 years ago

When trying to use TabularNLPAutoML preset there is a problem occurs with DataLoader (RuntimeError: DataLoader worker (pid 2645328) is killed by signal: Segmentation fault.). Full log can be found here - full log.txt.

To replicate I:

from lightautoml.automl.presets.text_presets import TabularNLPAutoML from lightautoml.tasks import Task from lightautoml.report.report_deco import ReportDecoNLP

data_dir = Path('./tmp') TARGET_NAME = 'label' THREAD_N = 32 FOLDS = 5 TIMEOUT = 3600 STATE = 42

df = pd.read_csv(data_dir / 'train.csv')

task = Task('binary') roles = {'target': TARGET_NAME, 'text': ['tweet'], } RD = ReportDecoNLP() automl = TabularNLPAutoML(task=task, timeout=TIMEOUT, cpu_limit=THREAD_N, general_params={'use_algos': [['lgb', 'lgb_tuned']]}, reader_params={'n_jobs': THREAD_N, 'cv': FOLDS, 'random_state': STATE}, gbm_pipeline_params={'text_features': "embed"}, text_params={'lang': 'multi'}, ) automl_rd = RD(automl) oof_pred = automl_rd.fit_predict(df, roles=roles) print('oof_pred:\n{}\nShape = {}'.format(oof_pred, oof_pred.shape))

After running this script I get the error as in the previously mentioned [log.txt](https://github.com/sberbank-ai-lab/LightAutoML/files/7212704/full.log.txt).

I tried to look at the processes through the debug, but, unfortunately, my knowledge in multiprocessing of your library as well as torch library isn't enough to provide a more thorough explanations. And there seems to be a connection to my particular setup.

My machine:
- CPU: Threadripper 2950x
- GPU: 2x1080TI

When trying to fix the problem, I found out that setting `THREAD_N` to lower numbers (1-2 in my) seems to fix it. I also noticed, that the library locates both of my gpus and down the line it, by default, sets the device for the `DLTransformer` as the first gpu, which then loads both model and texts on the gpu. My guess is that during the multiple worker fetching in `DataLoader` it tries to load all of the batches on the gpu (which have only 11GB of memory), since this problem doesn't occur on lower amount of threads. One of the fixes that I've come up with is to disable GPU for the the pipeline all together, e.g.:
```python
automl = TabularNLPAutoML(task=task,
                           ....
                           gpu_ids=None
                          )

Is this an intended behavior? What if someone would wanted to use the gpu to speed up their processing?

alexmryzhkov commented 2 years ago

Hi @qvviko,

Thanks for the such a detailed issue description - we are looking into it 👍

Alex

CrustaceanJ commented 2 years ago

Hi, @qvviko.

Can you please provide the output of the ulimit -a command in the terminal?

qvviko commented 2 years ago

@CrustaceanJ

~ ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         unlimited
-m: resident set size (kbytes)      unlimited
-u: processes                       256602
-n: file descriptors                1024
-l: locked-in-memory size (kbytes)  1024
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 256602
-q: bytes in POSIX msg queues       819200
-e: max nice                        0
-r: max rt priority                 0
-N 15:                              unlimited
github-actions[bot] commented 2 years ago

Stale issue message