waltsims / k-wave-python

A Python interface to k-Wave GPU accelerated binaries
https://k-wave-python.readthedocs.io/en/latest/
GNU General Public License v3.0
100 stars 29 forks source link

[BUG] The error in "/kwave/bin/linux/kspaceFirstOrder-CUDA" #386

Closed GuilJung closed 2 months ago

GuilJung commented 3 months ago

When I run the us_bmode_linear_transducer.py script with the condition RUN_SIMULATION = True, I encounter the following error:

"Command 'OMP_PLACES=cores OMP_PROC_BIND=SPREAD /mnt/k-wave-python/kwave/bin/linux/kspaceFirstOrder-CUDA -i /tmp/example_input_0.h5 -o /tmp/20-May-2024-23-07-56_kwave_input.h5 --p_raw -s 1' returned non-zero exit status 1.

The location of the code where the error occurs is: sensor_data = executor.run_simulation(k_sim.options.input_filename, k_sim.options.output_filename, options=executor_options) (kspaceFirstOrder3D, line465)

I'm currently using an Nvidia RTX 3090 with CUDA 12.0 and Python 3.9.

I've attempted various troubleshooting steps to identify the cause of the error, but have been unsuccessful.

waltsims commented 3 months ago

Hi @GuilJung,

What version of k-wave-python do you have installed?

GuilJung commented 3 months ago

Hi @waltsims.

I'm sorry for the late reply.

As mentioned in the instruction, I'm using version 0.3.3.

waltsims commented 3 months ago

This error can occur when you request a GPU simulation on a machine without a gpu. Can you share the output of NVIDIA-smi on the machine you're using?

waltsims commented 3 months ago

Can you also try running your simulation with verbosity turned set to two and share the output here? This can be done by setting the verbosity argument in the simulation execution options:

execution_options = SimulationExecutionOptions(
    is_gpu_simulation=True,
    verbose_level=2)
GuilJung commented 3 months ago

Hi,

I followed your advice and set verbose_level=2 in the code as shown below, but the result was the same as before.

The result is attached in the photo below. image

Here is the code with the verbose_level setting: image

This result is from running nvidia-smi: nvidia-smi

GuilJung commented 3 months ago

I have confirmed that torch.cuda.is_available() returns True when executed just before the code that throws an error. Is there any other way to check if the GPU is functioning properly?

The code that generates an error is as follows: sensor_data = executor.run_simulation(k_sim.options.input_filename, k_sim.options.output_filename, options=executor_options) (kspaceFirstOrder3D.py, line 465)

waltsims commented 3 months ago

It seems the first exit status you shared was 1 originally (error) and 127 the second time around (command not found). Can you share the screenshot of the output of k-wave above the error as well?

GuilJung commented 3 months ago

As you mentioned, I just recognized that the error code has changed.

What I modified was installing PyTorch 2.0.1 to use the torch.cuda.is_available() function.

After checking on another machine, I confirmed that error code 1 is output both before and after installing PyTorch. I don't understand why error code 127 is suddenly being output.

Is the screenshot of the error code you requested like the one below?

image

Additionally, the error has not been resolved yet. Could you let me know how to handle the "127 command not found" error?

djps commented 3 months ago

Probably a stupid question but - where is the binary file located? Can you confirm that it matches execution_options.binary_path?

@waltsims , @faridyagubbayli - it seems that https://github.com/waltsims/k-wave-python/blob/master/kwave/options/simulation_execution_options.py#L64 overwrites any user-supplied binary_path. I think this should be changed, so that if a value is supplied by the user it is not overwritten.

waltsims commented 3 months ago

Good point David. We can think of how to address that.

Some further debug steps would be can you share the output of

ls /mnt/DB2/JGL/k-wave-python/kwave/bin/linux/

and

pip freeze | grep kwave?

waltsims commented 3 months ago

My logic is that an error code of 1 could indicate a simulation or cuda issue that is not properly caught. An error code of 127 is indicative of a k-wave-python installation issue and the binaries not being where they are expected to be.

I had worked on error code 1 issues in https://github.com/waltsims/k-wave-python/issues/262

GuilJung commented 3 months ago

Thank you for reply me @waltsims and @djps .

Here is a output of above command in my machine.

image

Also, I confirmed that the location of self.execution_options.binary_path is "/mnt/DB1/JGL/kwave/k-wave-python/kwave/bin/linux/", and the kspaceFirstOrder-CUDA :binary file is included in that folder.

And the below is 'simulationExecutionOptions' which input to executor.run_simulation. image

I want to print details about fatal errors like #262, but why is it not printed even if I change it to verbose=2?

GuilJung commented 3 months ago

Could it be a compatibility issue with the CUDA version?

waltsims commented 3 months ago

k-wave-python supports CUDA 12. I don't believe that is the issue. I tried to reproduce the behavior you are experiencing with a fresh install on CUDA 12 but cannot. The example can run. You can run the command that failed manually in the command line.

i.e. run OMP_PLACES=cores OMP_PROC_BIND=SPREAD /mnt/DB1/JGL/k-wave-python/kwave/bin/linux/kspaceFirstOrder-CUDA -i /tmp/example_input_0.h5 -o /tmp/22-May-2024-17-42-51_kwave_input.h5 --praw -s 1

If this command is no longer valid you can copy and paste it from the error message that is thrown.

waltsims commented 3 months ago

If I run k-wave-python with no Cuda device available, I can generate an exit status 1, but k-wave-python still visualizes the binary output

 CUDA_VISIBLE_DEVICES="" python3 examples/us_bmode_linear_transducer/us_bmode_linear_transducer.py
┌───────────────────────────────────────────────────────────────┐
│                  kspaceFirstOrder-CUDA v1.3                   │
├───────────────────────────────────────────────────────────────┤
│ Reading simulation configuration:                        Done │
│ Selected GPU device id:                                Failed │
└───────────────────────────────────────────────────────────────┘

Traceback (most recent call last):
  File "/home/wsimson/git/k-wave-python/examples/us_bmode_linear_transducer/us_bmode_linear_transducer.py", line 119, in <module>
    sensor_data = kspaceFirstOrder3D(
  File "/home/wsimson/miniconda3/envs/k-wave-python3.9/lib/python3.9/site-packages/kwave/kspaceFirstOrder3D.py", line 465, in kspaceFirstOrder3D
    sensor_data = executor.run_simulation(k_sim.options.input_filename, k_sim.options.output_filename, options=executor_options)
  File "/home/wsimson/miniconda3/envs/k-wave-python3.9/lib/python3.9/site-packages/kwave/executor.py", line 42, in run_simulation
    raise subprocess.CalledProcessError(proc.returncode, command, stdout, stderr)
subprocess.CalledProcessError: Command 'OMP_PLACES=cores  OMP_PROC_BIND=SPREAD  /home/wsimson/miniconda3/envs/k-wave-python3.9/lib/python3.9/site-packages/kwave/bin/linux/kspaceFirstOrder-CUDA -i /tmp/example_input_0.h5 -o /tmp/23-May-2024-10-25-53_kwave_input.h5  --p_raw -s 1' returned non-zero exit status 1.

This is what a CUDA failed error should look like. By modifying the visible device, the same simulation runs.

CUDA_VISIBLE_DEVICES="0" python3 examples/us_bmode_linear_transducer/us_bmode_linear_transducer.py
┌───────────────────────────────────────────────────────────────┐
│                  kspaceFirstOrder-CUDA v1.3                   │
├───────────────────────────────────────────────────────────────┤
│ Reading simulation configuration:                        Done │
│ Selected GPU device id:                                     0 │
│ GPU device name:                   NVIDIA GeForce RTX 3090 Ti │
│ Number of CPU threads:                                     20 │
│ Processor name:          12th Gen Intel(R) Core(TM) i7-12700K │
├───────────────────────────────────────────────────────────────┤
│                      Simulation details                       │
├───────────────────────────────────────────────────────────────┤
│ Domain dimensions:                            256 x 128 x 128 │
│ Medium type:                                               3D │
│ Simulation time steps:                                   1586 │
├───────────────────────────────────────────────────────────────┤
│                        Initialization                         │
├───────────────────────────────────────────────────────────────┤
│ Memory allocation:                                       Done │
│ Data loading:                                            Done │
│ Elapsed time:                                           0.02s │
├───────────────────────────────────────────────────────────────┤
│ FFT plans creation:                                      Done │
│ Pre-processing phase:                                    Done │
│ Elapsed time:                                           0.22s │
├───────────────────────────────────────────────────────────────┤
│                    Computational resources                    │
├───────────────────────────────────────────────────────────────┤
│ Current host memory in use:                             532MB │
│ Current device memory in use:                         19184MB │
│ Expected output file size:                                9MB │
├───────────────────────────────────────────────────────────────┤
│                          Simulation                           │
├──────────┬────────────────┬──────────────┬────────────────────┤
│ Progress │  Elapsed time  │  Time to go  │  Est. finish time  │
├──────────┼────────────────┼──────────────┼────────────────────┤
│     0%   │        0.003s  │      2.208s  │  23/05/24 10:26:50 │
│     5%   │        0.219s  │      4.073s  │  23/05/24 10:26:52 │
│    10%   │        0.435s  │      3.879s  │  23/05/24 10:26:52 │
│    15%   │        0.651s  │      3.671s  │  23/05/24 10:26:52 │
│    20%   │        0.870s  │      3.457s  │  23/05/24 10:26:52 │
│    25%   │        1.087s  │      3.245s  │  23/05/24 10:26:52 │
│    30%   │        1.304s  │      3.031s  │  23/05/24 10:26:52 │
│    35%   │        1.523s  │      2.813s  │  23/05/24 10:26:52 │
│    40%   │        1.739s  │      2.598s  │  23/05/24 10:26:52 │
│    45%   │        1.955s  │      2.382s  │  23/05/24 10:26:52 │
│    50%   │        2.175s  │      2.164s  │  23/05/24 10:26:52 │
│    55%   │        2.391s  │      1.948s  │  23/05/24 10:26:52 │
│    60%   │        2.607s  │      1.732s  │  23/05/24 10:26:52 │
│    65%   │        2.823s  │      1.516s  │  23/05/24 10:26:52 │
│    70%   │        3.043s  │      1.297s  │  23/05/24 10:26:52 │
│    75%   │        3.259s  │      1.081s  │  23/05/24 10:26:52 │
│    80%   │        3.476s  │      0.865s  │  23/05/24 10:26:52 │
│    85%   │        3.695s  │      0.646s  │  23/05/24 10:26:52 │
│    90%   │        3.911s  │      0.430s  │  23/05/24 10:26:52 │
│    95%   │        4.128s  │      0.214s  │  23/05/24 10:26:52 │
├──────────┴────────────────┴──────────────┴────────────────────┤
│ Elapsed time:                                           4.36s │
├───────────────────────────────────────────────────────────────┤
│ Sampled data post-processing:                            Done │
│ Elapsed time:                                           0.00s │
├───────────────────────────────────────────────────────────────┤
│                            Summary                            │
├───────────────────────────────────────────────────────────────┤
│ Peak host memory in use:                                533MB │
│ Peak device memory in use:                            19184MB │
├───────────────────────────────────────────────────────────────┤
│ Total execution time:                                   5.67s │
├───────────────────────────────────────────────────────────────┤
│                       End of computation                      │
└───────────────────────────────────────────────────────────────┘

as mentioned, please run the failed command manually and see if that helps inform us.

GuilJung commented 3 months ago

Thanks for your advice.

Here is the output when I put the above command:

image

"Error: Unknown command line parameter or missing argument."
An error occurred, but the variable was never touched.

Is it possible to determine the cause through the image above?

waltsims commented 3 months ago

It looks like that might have been my mistake. There was a typo in my comment. The flag --praw should be --p_raw:

OMP_PLACES=cores OMP_PROC_BIND=SPREAD /mnt/DB1/JGL/k-wave-python/kwave/bin/linux/kspaceFirstOrder-CUDA -i /tmp/example_input_0.h5 -o /tmp/22-May-2024-17-42-51_kwave_input.h5 --p_raw -s 1

can you confirm which command you ran?

waltsims commented 3 months ago

Hi @GuilJung, can you confirm you are still experiencing this issue? Otherwise, I will close this issue for now.

GuilJung commented 3 months ago

Hi @waltsims

I conducted the test using the advice given, and as a result, I confirmed that it works properly in the environment of Ubuntu 20.04 / RTX 3090 / CUDA 12.0 / NVIDIA driver-550.

The machines we used previously had Ubuntu version 18.04, so the above command didn't work correctly due to the glibc version. I haven't yet confirmed if it works when the other versions are matched the same, except for that.

I think it's best to close the session after confirming that everything works properly.

waltsims commented 3 months ago

This is very helpful feedback. Can you please share the glibc version that was not supported and the version that was supported?

GuilJung commented 3 months ago

When using ubuntu 18.04/glibc 2.27, a message appeared asking me to use glibic 2.29 or higher. However, versions 2.27 and higher could not be used in Ubuntu 18.04, and it worked properly with glibc 3.31.

GuilJung commented 2 months ago

Hi @waltsims

Even though I tried installing various drivers and cuda in Ubuntu 18.04 version, it does not work well.

I think it could be related to the glibc version, but I'm wondering if there's any information related to this.

There might indeed be OS-specific requirements that need to be met.

waltsims commented 2 months ago

395 will make it easier to identify such issues in the future. I will have a look at listing minimum requirements.