twitter-research / cwn

Message Passing Neural Networks for Simplicial and Cell Complexes
MIT License
152 stars 23 forks source link

no space left on device while processing Ocean dataset #96

Closed cxw-droid closed 2 years ago

cxw-droid commented 2 years ago

Hi, Thanks for the interesting paper and for sharing the code.

I tried to reproduce the Trajectory classification. When I run sh ./exp/scripts/mpsn-flow.sh, there is an error No space left on device as follows. I cannot figure out what uses all the space and what space it is talking about. I have enough disk space and the top command show the memory is not used much.

$ sh ./exp/scripts/mpsn-flow.sh tanh WARNING:root:The OGB package is out of date. Your version is 1.3.1, while the latest version is 1.3.2.

Using device cuda:0 Fold: None Seed: 0 ======================== Args =========================== Namespace(batch_size=64, dataset='FLOW', device=0, drop_position='lin2', drop_rate=0.0, dump_curves=True, early_stop=False, emb_dim=64, epochs=100, eval_metric='accuracy', exp_name='flow_mpsn', final_readout='sum', flow_classes=3, flow_points=1000, fold=None, folds=None, fully_orient_invar=False, graph_norm='bn', indrop_rate=0.0, init_method='sum', iso_eps=0.01, jump_mode=None, lr=0.001, lr_scheduler='StepLR', lr_scheduler_decay_rate=0.5, lr_scheduler_decay_steps=20, lr_scheduler_min=1e-05, lr_scheduler_patience=10, max_dim=2, max_ring_size=None, minimize=False, model='edge_orient', nonlinearity='tanh', num_layers=4, num_workers=0, paraid=0, preproc_jobs=4, readout='sum', readout_dims=(0, 1, 2), result_folder='/home/xyz/code/cwn/exp/results', seed=0, simple_features=False, start_seed=0, stop_seed=4, task_type='classification', test_orient='random', train_eval_period=10, train_orient='default', tune=False, untrained=False, use_coboundaries='False', use_edge_features=False)

Processing... 0%| | 0/1000 [00:00<?, ?it/s]WARNING:root:The OGB package is out of date. Your version is 1.3.1, while the latest version is 1.3.2. WARNING:root:The OGB package is out of date. Your version is 1.3.1, while the latest version is 1.3.2. WARNING:root:The OGB package is out of date. Your version is 1.3.1, while the latest version is 1.3.2. WARNING:root:The OGB package is out of date. Your version is 1.3.1, while the latest version is 1.3.2. 22%|█████████████████████████████▎ | 224/1000 [00:45<02:37, 4.92it/s] joblib.externals.loky.process_executor._RemoteTraceback: """ Traceback (most recent call last): File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 480, in dump NumpyPickler(f, protocol=protocol).dump(value) File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/pickle.py", line 487, in dump self.save(obj) File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 279, in save wrapper.write_array(obj, self) File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 103, in write_array pickler.file_handle.write(chunk.tobytes('C')) OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 153, in feed obj = dumps(obj, reducers=reducers) File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps dump(obj, buf, reducers=reducers, protocol=protocol) File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj) File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump return Pickler.dump(self, obj) File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/joblib/_memmapping_reducer.py", line 442, in call for dumped_filename in dump(a, filename): File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 480, in dump NumpyPickler(f, protocol=protocol).dump(value) OSError: [Errno 28] No space left on device """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/xyz/code/cwn/exp/run_mol_exp.py", line 105, in exp_main(passed_args) File "/home/xyz/code/cwn/exp/run_mol_exp.py", line 26, in exp_main curves = main(parsed_args) File "/home/xyz/code/cwn/exp/run_exp.py", line 77, in main dataset = load_dataset(args.dataset, max_dim=args.max_dim, fold=args.fold, File "/home/xyz/code/cwn/data/data_loading.py", line 150, in load_dataset dataset = FlowDataset(os.path.join(root, name), name, num_points=kwargs['flow_points'], File "/home/xyz/code/cwn/data/datasets/flow.py", line 24, in init super(FlowDataset, self).init(root, max_dim=1, File "/home/xyz/code/cwn/data/datasets/dataset.py", line 140, in init super(InMemoryComplexDataset, self).init(root, transform, pre_transform, pre_filter, File "/home/xyz/code/cwn/data/datasets/dataset.py", line 62, in init super(ComplexDataset, self).init(root, transform, pre_transform, pre_filter) File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/torch_geometric/data/dataset.py", line 92, in init self._process() File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/torch_geometric/data/dataset.py", line 165, in _process self.process() File "/home/xyz/code/cwn/data/datasets/flow.py", line 50, in process train, val, G = load_flow_dataset(num_points=self._num_points, File "/home/xyz/code/cwn/data/datasets/flow_utils.py", line 321, in load_flow_dataset train_samples = parallel(delayed(generate_flow_cochain)( File "/home/xyz/code/cwn/data/parallel.py", line 14, in call return Parallel.call(self, *args, **kwargs) File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/joblib/parallel.py", line 1054, in call self.retrieve() File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/joblib/parallel.py", line 933, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result return future.result(timeout=timeout) File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/concurrent/futures/_base.py", line 437, in result return self.get_result() File "/home/xyz/miniconda3/envs/cwn/lib/python3.8/concurrent/futures/_base.py", line 389, in get_result raise self._exception _pickle.PicklingError: Could not pickle the task to send it to the workers. ld not pickle the task to send it to the workers.

crisbodnar commented 2 years ago

Hi @cxw-droid! This issue seems to be related to joblib which seems to run out of space for storing intermediate results of the parallel computation it orchestrates. I would suggest trying changing this line: https://github.com/twitter-research/cwn/blob/main/exp/scripts/mpsn-flow.sh#L13 to use a smaller number of preprocessing jobs. In particular, it should work when using 1 job (even though it might take slightly longer to process), which should pre-process everything serially without joblib.

cxw-droid commented 2 years ago

Thanks., it works if the job number is changed to 1.

But it does not work even if I change the job number to 2. I am not sure if it is normal. just FYI