track individual particle loss components, speedup inference

speed up pytorch inference code by using more efficient awkward calls
track individual loss components for each particle type
track (but don't minimize) event-based losses like sliced Wasserstein and MET loss

Training the following mamba-based model:

singularity exec --nv ~/HEP-KBFI/singularity/pytorch.simg\:2023-12-06 python3 mlpf/pyg_pipeline.py --config parameters/pyg-cms.yaml --dataset cms --gpus 1 --data-dir ~/tensorflow_datasets/ --train --test --make-plots --conv-type mamba --num-epochs 5 --gpu-batch-multiplier 5 --num-workers 1 --prefetch-factor 10 --ntest 1000 --checkpoint-freq 1 --lr 0.001

we are also tracking (but not minimizing) these event-based losses

The performance on QCD-highpT is as follows: jet_res met_res

jpata / particleflow

track individual particle loss components, speedup inference #284