Where should I put the images and masks if I am training the model using my own dataset?

spreka / biomagdsb

This repository contains the codes to run the nuclei segmentation pipeline of the BIOMAG group inspired by Kaggle's Data Science Bowl 2018 competition

52 stars 15 forks source link

Where should I put the images and masks if I am training the model using my own dataset? #14

Open chnlyi opened 2 years ago

chnlyi commented 2 years ago

I have read the codes "run_workflow_trainOnly.sh" and "start_training.sh".

I am confused about $IMAGES_DIR, $ORIGINAL_DATA, $TEST1, $TRAIN_UNET, $TRAIN_MASKRCNN.

After many trials and errors, I am able to run "start_training.sh" without error when I did the following:

Split my data into train, val and test into some folders
runGenerateValidationCustom.sh using my val data
copy both train and val data into $TRAIN_UNET and $TRAIN_MASKRCNN
copy test data into $TEST1
copy a few images out of test into $IMAGES_DIR
However, I am not sure this is what I should be doing.

Why do I need $IMAGES_DIR?

I am using your original $CLUSTER_CONFIG. Is that right?

spreka commented 2 years ago

@chnlyi Thank you for your interest in the repo. The code was prepared for the DSB2018 competition where $TEST1 was the stage1 test set we used as additional training data in the 2nd stage. When training on your own data, you can leave it as originally in the repo. The real test data goes in the $IMAGES_DIR, these will be used for style transfer learning and the trained model will predict segmentation on them. You don't need to use $ORIGINAL_DATA. The train data was separated to U-Net ($TRAIN_UNET) and Mask R-CNN ($TRAIN_MASKRCNN) in case you want to train on different images. You can use the same data in both places.

The convenience script start_training.sh is suggested to be used for running (it calls run_workflow_trainOnly.sh), where you can set the test image folder location for the variable IMAGES_DIR, these will be copied to the workflow folder kaggle_workflow/outputs/images/ and used in the pipeline.

Let me know if this helps.

chnlyi commented 2 years ago

@spreka Thank you so much for the prompt response. Very helpful!

I understand that the real test data goes in the $IMAGES_DIR, this is for prediction.

What I am trying is using my own annotated images to train the models from scratch. I have my train, validation, and test splits.

The first step of the training pipeline is to do the presegmentation, which uses $IMAGES_DIR data. So should I copy all of my train and validation images into it?
Should I copy my train into $TRAIN_UNET and $TRAIN_MASKRCNN? What about my validation data? (I experimented a few times, but it seems that I have to runGenerateValidationCustom.sh using my validation data and copy both train and validation data into $TRAIN_UNET and $TRAIN_MASKRCNN.)

spreka commented 2 years ago

@chnlyi The presegmentation result is also used in the the pipeline for preparing masks in style transfer learning&prediction, hence $IMAGES_DIR is only for test images, e.g. unlabelled images from the experiment.

Should I copy my train into $TRAIN_UNET and $TRAIN_MASKRCNN? What about my validation data? (I experimented a few times, but it seems that I have to runGenerateValidationCustom.sh using my validation data and copy both train and validation data into $TRAIN_UNET and $TRAIN_MASKRCNN.)

Yes, copy the train images there. Validation goes in the $VALIDATION folder which will be in kaggle_workflow/outputs/validation by default. Indeed you need to run runGenerateValidationCustom.sh /path/to/validationfolder, the $IMAGES_DIR variable in this script is only used to list the validation images, sorry for the confusing variable names.