graphnet-team / graphnet

A Deep learning library for neutrino telescopes
https://graphnet-team.github.io/graphnet/
Apache License 2.0
90 stars 92 forks source link

crippling gpu performance #440

Closed MoustHolmes closed 1 year ago

MoustHolmes commented 1 year ago

I have been running on graphnet cleaned data and the training have been really slow more than i would expect from a larger dataset. the previous dataset which wasn't graphnet cleaned is 1/10 of the current but i could run about 10 epochs in an hour ish. the larger dataset have run for now almost 24h and only done a half epoch which is slower than i would expect simply from the size increase. a possible culprit could be that the pulsemap contains events with no hits due to cleaning. other than that i have also noticed that if the task have more than one target it also seams to slows down significantly.

wandb logs data config

path: /groups/icecube/petersen/GraphNetDatabaseRepository/Upgrade_Data/sqlite3/dev_step4_upgrade_028_with_noise_dynedge_pulsemap_v3_merger_aftercrash.db
pulsemaps:
  - SplitInIcePulses_dynedge_v2_Pulses
  - features:
  - dom_x
  - dom_y
  - dom_z
  - dom_time
  - charge
  - rde
  - pmt_area
  - string
  - pmt_number
  - dom_number
  - pmt_dir_x
  - pmt_dir_y
  - pmt_dir_z
  - dom_type
truth:
  - energy
  - position_x
  - position_y
  - position_z
  - azimuth
  - zenith
  - pid
  - elasticity
  - inelasticity
  - energy_track
  - sim_type
  - interaction_type
  - energy_log10 # transformed variables added by me
  - energy_track_log10 # transformed variables added by me
  - energy_qt # transformed variables added by me
  - energy_track_qt # transformed variables added by me
  - index_column: event_no
truth_table: truth
seed: 21
selection:
  test: event_no % 7 == 0 & abs(pid) == 14
  validation: event_no % 7 == 1 & abs(pid) == 14
  train: event_no % 7 > 1 & abs(pid) == 14
...
arguments:
  coarsening: null
  detector:
    ModelConfig:
      arguments:
        graph_builder:
          ModelConfig:
            arguments: {columns: null, nb_nearest_neighbours: 8}
            class_name: KNNGraphBuilder
        scalers: null
      class_name: IceCubeUpgrade
  gnn:
    ModelConfig:
      arguments:
        add_global_variables_after_pooling: false
        dynedge_layer_sizes: null
        features_subset: null
        global_pooling_schemes: [min, max, mean, sum]
        nb_inputs: 14
        nb_neighbours: 8
        post_processing_layer_sizes: null
        readout_layer_sizes: null
      class_name: DynEdge
  optimizer_class: '!class torch.optim.adam Adam'
  optimizer_kwargs: {eps: 0.001, lr: 1e-05}
  scheduler_class: '!class torch.optim.lr_scheduler ReduceLROnPlateau'
  scheduler_config: {frequency: 1, monitor: val_loss}
  scheduler_kwargs: {patience: 5}
  tasks:
  - ModelConfig:
      arguments:
        hidden_size: 128
        loss_function:
          ModelConfig:
            arguments: {}
            class_name: LogCoshLoss
        loss_weight: null
        target_labels: inelasticity
        transform_inference: null
        transform_prediction_and_target: null
        transform_support: null
        transform_target: null
      class_name: InelasticityReconstruction
class_name: StandardModel
MortenHolmRep commented 1 year ago

From the logs it looks like you are using GPU 1 but the GPU utilization rate on GPU 1 is abysmal and looks like its an improper run. Have you tested for different configurations such as learning rate, batch size etc. ?

I would recommend that you terminate the run as it does not use any gpu resources. I have had similar runs where I had to test different configurations.

asogaard commented 1 year ago

I second @MortenHolmRep's point above: You see slow training because the GPU is basically not being utilised. This could be due to a too-small batch size, too few workers to load data, expensive CPU operations either in the model or in the dataloaders, etc. or you running your training script at the same time as someone else, and they're doing some of the above.

I'll be closing this issue as, in its current form, I wouldn't consider this a bug in the graphnet code, rather a poorly performing training run. I suggest you use the #help channel in Slack to get feedback to improve the GPU utilisation. If you do narrow down the problem to part of the graphnet code, please do open a dedicated issue.