normalize loss function for classification by the number of not padded elements
normalized loss function for regression by the number of true target particles and with the stddev of the regression targets for more stable loss values
reparametrize attention network to have num_heads, head_dim
produced new version 1.7.1 of cms_pf_multi_particle_gun with more stats
remove unneeded pad_power_of_two option (seems like FlashAttention does that internally), not sure why I thought it was needed
make regression output type configurable
disable charge prediction for now (so far we didn't really study its performance)
cms_pf_multi_particle_gun
with more statspad_power_of_two
option (seems like FlashAttention does that internally), not sure why I thought it was needed