model reproduce TCGA image data failed, output overflow NAN

Error occurred: Prov-gigapath main \ finetune \ training -->evaluate function

def evaluate(loader,  model, fp16_scaler, loss_fn, epoch, args):
      model.eval()
      # set the evaluation records
      records = get_records_array(len(loader), args.n_classes) 
      # get the task setting
      task_setting = args.task_config.get('setting', 'multi_class')
      with torch.no_grad():
            for batch_idx, batch in enumerate(loader):
                  # load the batch and transform this batch
                  images, img_coords, label = batch['imgs'], batch['coords'], batch['labels']
                  images = images.to(args.device, non_blocking=True)
                  img_coords = img_coords.to(args.device, non_blocking=True)
                  label = label.to(args.device, non_blocking=True).long()
                  with torch.cuda.amp.autocast(fp16_scaler is not None,  dtype=torch.float16):
                        # get the logits
                        print(images)
                        print(img_coords)
                        logits = model(images, img_coords)
                        print(logits)

Data source: TCGA-HF-7132-01Z-00-DX1 （And over 200 TCGA images that were not mentioned all had errors）

Image feature matrix processed by tile enconder (images) tensor([[[-0.1227, 0.9527, 0.4235, ..., -0.6721, -0.6071, 1.1120], [-0.1409, 1.0089, 0.2850, ..., 0.0581, -0.9595, 1.1386], [ 0.8467, -0.1940, 0.7480, ..., -0.4585, -1.4716, 1.0268], ..., [ 0.5098, 0.1598, 0.8739, ..., -0.1878, -0.5160, 0.7138], [ 0.3427, 0.3222, 0.4500, ..., -1.0347, -1.1425, 1.6356], [ 0.2730, 1.4231, -0.0403, ..., -1.1742, -0.8633, 1.2176]]], device='cuda:0')

Image coordinate matrix processed by tile encoonder (imd_coords) tensor([[[137044., 12944.], [120660., 26256.], [120660., 19088.], ..., [ 65362., 11920.], [ 79698., 25232.], [ 1872., 23184.]]], device='cuda:0')

All values of the output matrix are NAN: (logits) tensor([[nan, nan]], device='cuda:0', dtype=torch.float16)

I store the code on Google Cloud Drive. Here is my experimental data

https://drive.google.com/file/d/1MHt_JbzRFCInIqYu47vL9a2MTdwvUMjU/view?usp=sharing

Reproduce the process TCGA data ->processed with 1.py ->obtained H5 file ->evaluated using modified PANDA.sh

1.py : Responsible for sending TCGA files to tile enconder and obtaining H5 files

H5 file: located in data/dinov2_features/h5files

Data CSV file: divided into datacsv/PANDA PANDA: Labels for all 301 TCGA images Train: training set Test: Test set Val: validation set

Use the following command to reproduce the error:

bash scripts/run_panda.sh data/GigaPath_PANDA_embeddings/h5_files

prov-gigapath / prov-gigapath

model reproduce TCGA image data failed, output overflow NAN #85