Remora Model Training with Apple Silicon (M3 Pro) -> TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

akhilp24 commented 4 months ago

Hello,

I would like to run the "remora model train" command on my M3 Pro MacBook Pro with 18GB of RAM, utilizing the device's GPU; however, I run into the error:

TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

after running the command

remora \
  model train \
  train_dataset.jsn \
  --model ConvLSTM_w_ref.py \
  --device mps \
  --chunk-context 50 50 \
  --output-path train_results

to specify I would like to use the GPU on my MacBook for the model's training.

This is the full output following the above command:

[18:54:25.963] Seed selected is 960564580
[18:54:25.964] Loading dataset from Remora dataset config
[18:54:26.006] Dataset summary:
                     size : 84,755
     modified_base_labels : True
                mod_bases : ['o']
           mod_long_names : ['8oxoG']
       kmer_context_bases : (4, 4)
            chunk_context : (50, 50)
                   motifs : [('GGG', 2), ('TTAGGG', 4)]
           reverse_signal : False
 chunk_extract_base_start : False
     chunk_extract_offset : 0
          sig_map_refiner : Loaded 9-mer table with 7 central position. Rough re-scaling will be executed.

[18:54:26.006] Loading model
[18:54:26.015] Model structure:
network(
  (sig_conv1): Conv1d(1, 4, kernel_size=(5,), stride=(1,))
  (sig_bn1): BatchNorm1d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (sig_conv2): Conv1d(4, 16, kernel_size=(5,), stride=(1,))
  (sig_bn2): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (sig_conv3): Conv1d(16, 64, kernel_size=(9,), stride=(3,))
  (sig_bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (seq_conv1): Conv1d(36, 16, kernel_size=(5,), stride=(1,))
  (seq_bn1): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (seq_conv2): Conv1d(16, 64, kernel_size=(13,), stride=(3,))
  (seq_bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (merge_conv1): Conv1d(128, 64, kernel_size=(5,), stride=(1,))
  (merge_bn): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (lstm1): LSTM(64, 64)
  (lstm2): LSTM(64, 64)
  (fc): Linear(in_features=64, out_features=2, bias=True)
  (dropout): Dropout(p=0.3, inplace=False)
)
[18:54:26.602] Params (k) 134.08 | MACs (M) 7327.45
[18:54:26.602] Preparing training settings
Traceback (most recent call last):
  File "/opt/anaconda3/bin/remora", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/remora/main.py", line 71, in run
    cmd_func(args)
  File "/opt/anaconda3/lib/python3.11/site-packages/remora/parsers.py", line 857, in run_model_train
    train_model(
  File "/opt/anaconda3/lib/python3.11/site-packages/remora/train_model.py", line 250, in train_model
    model = model.to(device)
            ^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1173, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 853, in _apply
    self._buffers[key] = fn(buf)
                         ^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1159, in convert
    return t.to(
           ^^^^^
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

Inputting this same command but replacing "mps" with "cpu" works; however it is very slow.

Before this, I received an error that the ConvLSTM_w_ref.py file could not be found in the remora library, which is why I have downloaded the file from the remora GitHub repository and placed it in my main directory.

The command listed on the documentation in PyPi and GitHub did not work:

remora \
  model train \
  train_dataset.jsn \
  --model remora/models/ConvLSTM_w_ref.py \
  --device 0 \
  --chunk-context 50 50 \
  --output-path train_results

I am using Remora version: 3.2.0.

Please advise. Thank you.

marcus1487 commented 3 months ago

Unfortunately this would likely require a large change as the remora models are intended to be transferrable to Dorado and the model architecture (including the float precision) are fixed in the Dorado code. Thus I don't think we can robustly support this in the near term. We can look into using float32 in a future round of model training, but we would need to see little to no reduction in accuracy if we did make this change.

You could train a model that would only be useable within Remora (and this is also untested) by modifying the remora/models/ConvLSTM_w_ref.py file to specify the dtype as torch.Float32 for each layer of the network. I hope this helps. I am going to close this as unplanned for the moment, but if you have specific issues training or running the model with this workaround within remora please re-open this thread.

akhilp24 commented 3 months ago

I tried specifying the dtype as float32 for each layer of the network in that file and was met with this error:

[21:12:25.738] Gradients will be clipped (by value) at 0.00 MADs above the median of the last 1000 gradient maximums.
[21:12:26.305] Params (k) 134.08 | MACs (M) 7327.45
[21:12:26.305] Preparing training settings
Traceback (most recent call last):
  File "/Users/akhilpeddikuppa/miniconda3/bin/remora", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "/Users/akhilpeddikuppa/miniconda3/lib/python3.12/site-packages/remora/main.py", line 71, in run
    cmd_func(args)
  File "/Users/akhilpeddikuppa/miniconda3/lib/python3.12/site-packages/remora/parsers.py", line 1008, in run_model_train
    train_model(
  File "/Users/akhilpeddikuppa/miniconda3/lib/python3.12/site-packages/remora/train_model.py", line 349, in train_model
    model = model.to(device)
            ^^^^^^^^^^^^^^^^
  File "/Users/akhilpeddikuppa/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1173, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/akhilpeddikuppa/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 853, in _apply
    self._buffers[key] = fn(buf)
                         ^^^^^^^
  File "/Users/akhilpeddikuppa/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1159, in convert
    return t.to(
           ^^^^^
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

nanoporetech / remora

Remora Model Training with Apple Silicon (M3 Pro) -> TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead. #173