关于用自己的数据集训练

youngprogrammerBee commented 1 year ago

你们好，首先非常感谢你们的工作！我想请问如果要尝试在自己的数据集上进行训练的话，需要改那些文件？我发现你们还没有提供对自己数据集的支持，但是非常想尝试用你们的模型试着跑跑实验看看分割效果感谢：）

youngprogrammerBee commented 1 year ago

请问相关环境需要装的库/包可以分享一下吗感谢！

hsiangyuzhao commented 1 year ago

Hi @youngprogrammerBee , Thank you for your interest in our work. Actually, training on customized dataset will NOT include additional Python libraries. A brief introduction of customized data training is listed below:

Prepare your data into NIFTI format (or other format that MONAI library could support, please refer to their official documentation for details);
Split your training and validation data under the root path of your data storage. The expected format of data storage is listed below:
```
- data_root
- train_images
- train_labels
- val_images
- val_labels
```
It should be noted that training images should have labels (which means that some labeled images should be regarded as unlabeled, in order to simulate the semi-supervised setting on LA or Pancreas dataset)

If you are training on a real-world semi-supervised setting (some images are labeled while others are not), you need to prepare your labeled data and unlabeled data separately:

- labeled_root
- train_images
- train_labels
- val_images
- val_labels
- unlabeled_root
- train_images
- train_labels (an empty folder)

Then you need to modify the train.py file as below:

data_pipeline = TrainValDataPipeline(image_root, 'labeled', label_ratio=1.0, random_seed=seed)
unlabeled_pipeline = TrainValDataPipeline(unlabeled_root, 'unlabeled', label_ratio=1.0, random_seed=seed)
trainset, _, valset = data_pipeline.get_dataset(train_aug, val_aug, cache_dataset=False)
unlabeled_set, _, _ = unlabeled_pipeline.get_dataset(train_aug, val_aug, cache_dataset=False)
train_sampler = DistributedSampler(trainset, shuffle=True)
unlabeled_sampler = DistributedSampler(unlabeled_set, shuffle=True)
val_sampler = DistributedSampler(valset)

By defining two data_pipeline instances, you could generate the corresponding training dataset, unlabeled dataset and test dataset, respectively. Then the training will be identical to LA or Pancreas dataset.

Support for 2D training is under development.

Please note that the training hyperparameter may need sophisticated finetuning on customized dataset, and the performance of our method can NOT be guaranteed. Thanks again for your interest. Please feel free to ask if you have further problems.

xqzeng13 commented 1 year ago

你好，我想用换成自己数据集，但是我得设备不支持分布式训练， DistributedSampler(trainset, shuffle=True)该如何修改呢

hsiangyuzhao commented 1 year ago

Hi @xqzeng13 , You do not need to change it. DDP is also supported in single card scenario. You can run the training script as follows:

CUDA_VISIBLE_DEVICES='0' torchrun --nproc_per_node=1 train.py --mixed --benchmark --task $TASK --exp_name $EXP_NAME --wandb --entity $USER_NAME

youngprogrammerBee commented 1 year ago

Hi @youngprogrammerBee , Thank you for your interest in our work. Actually, training on customized dataset will NOT include additional Python libraries. A brief introduction of customized data training is listed below:

Prepare your data into NIFTI format (or other format that MONAI library could support, please refer to their official documentation for details);

Split your training and validation data under the root path of your data storage. The expected format of data storage is listed below:
- data_root
    - train_images
    - train_labels
    - val_images
    - val_labels
It should be noted that training images should have labels (which means that some labeled images should be regarded as unlabeled, in order to simulate the semi-supervised setting on LA or Pancreas dataset)

If you are training on a real-world semi-supervised setting (some images are labeled while others are not), you need to prepare your labeled data and unlabeled data separately:
- labeled_root
    - train_images
    - train_labels
    - val_images
    - val_labels
- unlabeled_root
    - train_images
    - train_labels (an empty folder)
Then you need to modify the train.py file as below:
data_pipeline = TrainValDataPipeline(image_root, 'labeled', label_ratio=1.0, random_seed=seed)
unlabeled_pipeline = TrainValDataPipeline(unlabeled_root, 'unlabeled', label_ratio=1.0, random_seed=seed)
trainset, _, valset = data_pipeline.get_dataset(train_aug, val_aug, cache_dataset=False)
unlabeled_set, _, _ = unlabeled_pipeline.get_dataset(train_aug, val_aug, cache_dataset=False)
train_sampler = DistributedSampler(trainset, shuffle=True)
unlabeled_sampler = DistributedSampler(unlabeled_set, shuffle=True)
val_sampler = DistributedSampler(valset)
By defining two data_pipeline instances, you could generate the corresponding training dataset, unlabeled dataset and test dataset, respectively. Then the training will be identical to LA or Pancreas dataset.

Support for 2D training is under development.

Please note that the training hyperparameter may need sophisticated finetuning on customized dataset, and the performance of our method can NOT be guaranteed. Thanks again for your interest. Please feel free to ask if you have further problems.

您好，感谢回复！我按照您说的准备了数据集（选择了第三条，即有unlabeled_root），在训练时遇到以下问题想请问： 1.我的数据集都为nii.gz格式这个应该是可以的吧 2.请问使用自己的数据集进行训练，观察您所给出的需要改的代码，其中的image_root来源于 prepare_experiment函数，但是该函数并没有返回unlabeled_root的路径，我应该按照该函数已有任务（la 或者Pancreas）中的来吗，还是要自己定义新的任务，如果需要自己定义任务，请问又该如何定义？

现在我的文件路径图如下

其中semi_label对应 labeled_root 即- labeled_root

train_images
train_labels
val_images
val_labels semi_unlabel对应unlabeled_root 即- unlabeled_root
train_images
train_labels (an empty folder)

hsiangyuzhao commented 1 year ago

Hi @youngprogrammerBee ,

1.我的数据集都为nii.gz格式这个应该是可以的吧

NIFTI is supported, that's fine.

请问使用自己的数据集进行训练，观察您所给出的需要改的代码，其中的image_root来源于 prepare_experiment函数，但是该函数并没有返回unlabeled_root的路径，我应该按照该函数已有任务（la 或者Pancreas）中的来吗，还是要自己定义新的任务，如果需要自己定义任务，请问又该如何定义？

You can simply pass the labeled_root and unlabeled_root, without calling prepare_experiment function. We introduce prepare_experiment function just to simplify the train.py, as we do not want to set a fixed argument in the main training file. You could either define a new prepare_experiment function which suits your case, or you could simply hardcode all the arguments needed in the main training file.

shenbw99 commented 1 year ago

@youngprogrammerBee 你好你好，请问你最后在自己的数据集上运行成功了吗？我也在做这个方面的研究，方便互相交流一下吗

hsiangyuzhao / RCPS

关于用自己的数据集训练 #8