Fixes flash_attn + cascade attention_code to decoder Transformer bloc…

Cascade attention_code to decoder Transformer block (wasn't the case !)
fixes flash_attn dim ordering (It still won't work on grib > 64x64 due to the way the unetrpp splits attention heads accross channels and not flattened physical (pixels, voxels) dims
adds num_heads_decoder to allow changing the number of attention heads in the decoder
adds a Dockerfile to work on EWC with A100 and flash_attn, we need an older cuda version, added doc for that

meteofrance / py4cast