class - Githubissues

luocmin commented 3 years ago

There's a little bit of an error with the class. I changed one line of code to change args.tc.classes to args.tc.num_uclasses as shown below.

luocmin commented 3 years ago

Could you please tell me how to fix this error? And where did I find the args.dataset_name. Is there a use_naive_taxonomy parameter missing here?

luocmin commented 3 years ago

Author, there is this problem at present, I don't know how to solve it, the situation is urgent, thank you！！

luocmin commented 3 years ago

This is how I changed the number of dataset into the number of GPU 1. I don't know if this means the training has begun

luocmin commented 3 years ago

Author, could you spare me a moment? Could you help me with the question I had two days ago? Thank you

johnwlambert commented 3 years ago

Hi @luocmin , please pull the latest version. Let me know if this doesn't answer your questions:

(1) You're right that tc.classes has been changed to tc.num_uclasses, thanks for catching that. I've corrected the train.py script in my latest commit.

(2) dataset_name is set in Line 497: https://github.com/mseg-dataset/mseg-semantic/blob/training/mseg_semantic/tool/train.py#L497

(3) Please pull in the latest master of mseg-semantic into your branch to see the parameter use_naive_taxonomy: https://github.com/mseg-dataset/mseg-semantic/blob/master/mseg_semantic/utils/transform.py#L54

(4) I didn't catch how you are doing the dataset to GPU mapping, could you explain in more detail here?

If you are limited by GPU RAM, you could also

luocmin commented 3 years ago

@johnwlambert Hi, 1、What I want to ask about this picture is what is the interface for successful training?

2、In addition, I modified some of the code in the MSEg-3m.YAMl file, May I ask whether this modification is correct ？ as shown below： change: to:dataset: [ade20k-150-relabeled] GPU_map: to:dataset_gpu_mapping: {'ade20k-150-relabeled': [0]}

luocmin commented 3 years ago

3、If I modified this class, do I need to recompile mseg_semantic

luocmin commented 3 years ago

The following problems occurred in training two data sets with two CARDS. What is the reason? How to solve it?

johnwlambert commented 3 years ago

Hi @luocmin, there is no compilation involved in our repo since all files are pure Python or bash.

Our configuration is to use 7 processes. Each process processes one dataset https://github.com/mseg-dataset/mseg-semantic/blob/f8afb3cb637bd5e921a1689681e5a7044a716b57/mseg_semantic/tool/train.py#L537

The gpu index (0,1,2...,6) is the rank and we call

def get_rank_to_dataset_map(args) -> Dict[int,str]:
    """
        Obtain a mapping from GPU rank (index) to the name of the dataset residing on this GPU.
        Args:
        -   args
        Returns:
        -   rank_to_dataset_map
    """
    rank_to_dataset_map = {}
    for dataset, gpu_idxs in args.dataset_gpu_mapping.items():
        for gpu_idx in gpu_idxs:
            rank_to_dataset_map[gpu_idx] = dataset
    print('Rank to dataset map: ', rank_to_dataset_map)
    return rank_to_dataset_map

args.dataset_name = rank_to_dataset_map[args.rank]
...
train_data = dataset.SemData(split='train', data_root=args.data_root[args.dataset_name], data_list=args.train_list[args.dataset_name], transform=train_transform)
...
train_loader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size, shuffle=(train_sampler is None), num_workers=args.workers, pin_memory=True, sampler=train_sampler, drop_last=True)

See here

Changing our config by mapping each dataset to the same GPU will mean that only one dataset is trained (the last one to hit line 394 https://github.com/mseg-dataset/mseg-semantic/blob/f8afb3cb637bd5e921a1689681e5a7044a716b57/mseg_semantic/tool/train.py#L394).

You will need a different strategy about how to use fewer GPUs. I mentioned a few already (concatenating all image IDs into a single dataset, which could be sharded across 4 gpus, or instead you could accumulate gradients in place over 2 forward and backward passes, and then perform a single gradient update).

luocmin commented 3 years ago

Thank you, but being white I won't change the dataloader code for now

luocmin commented 3 years ago

I have been trying to run two data sets with two CARDS, but it still doesn't work. May I ask if I need to use the script you mentioned before to run the command? There is no script in the report. This is the command I ran on the server with an error:

mseg-dataset / mseg-semantic

class #15