Training on a system with no GPU

vshcherbyna commented 3 years ago

Hello,

Thank you for creating a nice project for nnue training in pytorch!

I am trying to use your project to create a network for Igel. I wanted to ask you if it is possible to have trainer in "CPU mode only" as I am renting some "bare metal hardware" which has powerful CPUs, but no GPU is present. When I run the trainer on Ubuntu 20.04 I get this:

_python train.py --smart-fen-skipping --random-fen-skipping 10 --batch-size 16384 --threads 8 /home/volodymyr/training/sharpen_data/total_759m_d12.bin /home/volodymyr/training/total_30m_d14.bin Feature set: HalfKP^ Num real features: 41024 Num virtual features: 704 Num features: 41728 Training with /home/volodymyr/training/sharpen_data/total_759m_d12.bin validating with /home/volodymyr/training/total_30m_d14.bin Global seed set to 42 Seed 42 Using batch size 16384 Smart fen skipping: True Random fen skipping: 10 limiting torch to 8 threads. Using log dir logs/ /home/volodymyr/nnue-pytorch/env/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: ModelCheckpoint(save_last=True, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None). warnings.warn(*args, **kwargs) GPU available: False, used: False TPU available: False, using: 0 TPU cores Traceback (most recent call last): File "train.py", line 105, in main() File "train.py", line 93, in main main_device = trainer.root_device if trainer.root_gpu is None else 'cuda:' + str(trainer.root_gpu) AttributeError: 'Trainer' object has no attribute 'rootdevice'

On some discussions on talkchess I saw some people were saying they managed to use the trainer in CPU mode, but no specifics on how.

I thought my request is common enough as other people may end up in the same situation as well, please let me know if it is possible.

Thank you very much, With best regards, Volodymyr

Sopel97 commented 3 years ago

You can try hardcoding main_device to 'cpu'. Locally I can run on cpu only by simply not including --gpus flag, but I do have a GPU so some checks might pass silently.

vshcherbyna commented 3 years ago

Thanks Sopel97,

I tried a various combinations to do this, they all seem to fail, please see below:

main_device = cpu

Throws exception: NameError: name 'cpu' is not defined

main_device = trainer.root_cpu

Throws exception: AttributeError: 'Trainer' object has no attribute 'root_cpu'

main_device = trainer.cpu

Throws exception: AttributeError: 'Trainer' object has no attribute 'cpu'

Sopel97 commented 3 years ago

try main_device = 'cpu'

vshcherbyna commented 3 years ago

Thanks Sopel97!

I tried it and the training failed with error "RuntimeError: Pinned memory requires CUDA". I went on and removed ".pin_memory()" from nnue_dataset.py and it starts but immediately crashes:

python train.py --smart-fen-skipping --random-fen-skipping 10 --batch-size 16384 --threads 32 --num-workers 32 /home/volodymyr/training/sharpen_data/total_759m_d12.bin /home/volodymyr/training/total_30m_d14.bin
Feature set: HalfKP^
Num real features: 41024
Num virtual features: 704
Num features: 41728
Training with /home/volodymyr/training/sharpen_data/total_759m_d12.bin validating with /home/volodymyr/training/total_30m_d14.bin
Global seed set to 42
Seed 42
Using batch size 16384
Smart fen skipping: True
Random fen skipping: 10
limiting torch to 32 threads.
Using log dir logs/
/home/volodymyr/nnue-pytorch/env/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: ModelCheckpoint(save_last=True, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None).
  warnings.warn(*args, **kwargs)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
Using c++ data loader
Ranger optimizer loaded.
Gradient Centralization usage = True
GC applied to both conv and fc layers

  | Name   | Type   | Params
----------------------------------
0 | input  | Linear | 10.7 M
1 | l1     | Linear | 16.4 K
2 | l2     | Linear | 1.1 K
3 | output | Linear | 33
----------------------------------
10.7 M    Trainable params
0         Non-trainable params
10.7 M    Total params
42.801    Total estimated model params size (MB)
/home/volodymyr/nnue-pytorch/env/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 32 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Validation sanity check:   0%|
| 0/2 [00:00<?, ?it/s]
Segmentation fault (core dumped)

Sopel97 commented 3 years ago

I'm not sure if async transfer works on cpu but if it does that would explain the crash (since the memory is not pinned and becomes invalidated).

Try:

def get_tensors(self, device):
        white_values = torch.from_numpy(np.ctypeslib.as_array(self.white_values, shape=(self.num_active_white_features,))).clone()
        black_values = torch.from_numpy(np.ctypeslib.as_array(self.black_values, shape=(self.num_active_black_features,))).clone()
        iw = torch.transpose(torch.from_numpy(np.ctypeslib.as_array(self.white, shape=(self.num_active_white_features, 2))), 0, 1).long()
        ib = torch.transpose(torch.from_numpy(np.ctypeslib.as_array(self.black, shape=(self.num_active_white_features, 2))), 0, 1).long()
        us = torch.from_numpy(np.ctypeslib.as_array(self.is_white, shape=(self.size, 1))).clone()
        them = 1.0 - us
        outcome = torch.from_numpy(np.ctypeslib.as_array(self.outcome, shape=(self.size, 1))).clone()
        score = torch.from_numpy(np.ctypeslib.as_array(self.score, shape=(self.size, 1))).clone()
        white = torch._sparse_coo_tensor_unsafe(iw, white_values, (self.size, self.num_inputs))
        black = torch._sparse_coo_tensor_unsafe(ib, black_values, (self.size, self.num_inputs))
        white._coalesced_(True)
        black._coalesced_(True)
        layer_stack_indices = torch.from_numpy(np.ctypeslib.as_array(self.layer_stack_indices, shape=(self.size,))).long()
        return us, them, white, black, outcome, score, layer_stack_indices

Also, try setting --num-workers to 1. It's enough for training on the GPU, so will be well more than enough for a CPU.

vshcherbyna commented 3 years ago

Thanks Sopel97! I copied pasted your original code but the trainer complained about layer_stack_indices, so once I removed layer_stack_indices it seems to be working :) Thank you!

Here is git diff for completeness for those who want to try the same:

diff --git a/nnue_dataset.py b/nnue_dataset.py
index 72b9975..9dadf72 100644
--- a/nnue_dataset.py
+++ b/nnue_dataset.py
@@ -29,14 +29,14 @@ class SparseBatch(ctypes.Structure):
     ]

     def get_tensors(self, device):
-        white_values = torch.from_numpy(np.ctypeslib.as_array(self.white_values, shape=(self.num_active_white_features,))).pin_memory().to(device=device, non_blocking=True)
-        black_values = torch.from_numpy(np.ctypeslib.as_array(self.black_values, shape=(self.num_active_black_features,))).pin_memory().to(device=device, non_blocking=True)
-        iw = torch.transpose(torch.from_numpy(np.ctypeslib.as_array(self.white, shape=(self.num_active_white_features, 2))).pin_memory().to(device=device, non_blocking=True), 0, 1).long()
-        ib = torch.transpose(torch.from_numpy(np.ctypeslib.as_array(self.black, shape=(self.num_active_white_features, 2))).pin_memory().to(device=device, non_blocking=True), 0, 1).long()
-        us = torch.from_numpy(np.ctypeslib.as_array(self.is_white, shape=(self.size, 1))).pin_memory().to(device=device, non_blocking=True)
+        white_values = torch.from_numpy(np.ctypeslib.as_array(self.white_values, shape=(self.num_active_white_features,))).clone()
+        black_values = torch.from_numpy(np.ctypeslib.as_array(self.black_values, shape=(self.num_active_black_features,))).clone()
+        iw = torch.transpose(torch.from_numpy(np.ctypeslib.as_array(self.white, shape=(self.num_active_white_features, 2))), 0, 1).long()
+        ib = torch.transpose(torch.from_numpy(np.ctypeslib.as_array(self.black, shape=(self.num_active_white_features, 2))), 0, 1).long()
+        us = torch.from_numpy(np.ctypeslib.as_array(self.is_white, shape=(self.size, 1))).clone()
         them = 1.0 - us
-        outcome = torch.from_numpy(np.ctypeslib.as_array(self.outcome, shape=(self.size, 1))).pin_memory().to(device=device, non_blocking=True)
-        score = torch.from_numpy(np.ctypeslib.as_array(self.score, shape=(self.size, 1))).pin_memory().to(device=device, non_blocking=True)
+        outcome = torch.from_numpy(np.ctypeslib.as_array(self.outcome, shape=(self.size, 1))).clone()
+        score = torch.from_numpy(np.ctypeslib.as_array(self.score, shape=(self.size, 1))).clone()
         white = torch._sparse_coo_tensor_unsafe(iw, white_values, (self.size, self.num_inputs))
         black = torch._sparse_coo_tensor_unsafe(ib, black_values, (self.size, self.num_inputs))
         white._coalesced_(True)
diff --git a/train.py b/train.py
index 56e6c12..d5f6b94 100644
--- a/train.py
+++ b/train.py
@@ -89,9 +89,7 @@ def main():
   tb_logger = pl_loggers.TensorBoardLogger(logdir)
   checkpoint_callback = pl.callbacks.ModelCheckpoint(save_last=True)
   trainer = pl.Trainer.from_argparse_args(args, callbacks=[checkpoint_callback], logger=tb_logger)
-
-  main_device = trainer.root_device if trainer.root_gpu is None else 'cuda:' + str(trainer.root_gpu)
-
+  main_device = 'cpu'
   if args.py_data:
     print('Using python data loader')
     train, val = data_loader_py(args.train, args.val, feature_set, batch_size, main_device)

official-stockfish / nnue-pytorch

Training on a system with no GPU #87