seoungwugoh / STM

Video Object Segmentation using Space-Time Memory Networks
405 stars 81 forks source link

Some questions about training #6

Open aaaaannnnn11111 opened 4 years ago

aaaaannnnn11111 commented 4 years ago

Thanks for you answered my previous question,but i still have many questions...... how did you choose the first frame of the 3 temporally ordered frames? how many epochs will you increase the maximum_skip? what is maximum_skip when the dataset is youtube vos? Thk a lot !

seoungwugoh commented 4 years ago

Hi, Here are answers: 1) Among sampled frames, frame image with the lowest frame index will be the first frame. First frame for training clip does not need to be the first frame of the video. 2) maximum_skip is increased by 5 at every 20 epoch during main-training. 3) The use of maximum_skip is the same for DAVIS and Youtube-VOS.

aaaaannnnn11111 commented 4 years ago

question about 1:so you randomed choose the first frame,and base the index of the first frame to choose the other two frames,right?for each epoch,did you used all images of each video?or you only used 3 frames of each video?I found it would be too slow if i use all images to train.

question about 3:in youtube-vos,the video "b938d79dff" only has four frames for training,the number is less than maximum_skip, how did you deal with this case?

seoungwugoh commented 4 years ago

1) We only use sampled 3 frames for training. That's the reason why we sample frames from videos. 2) maximum_skip is maximum possible number for skip, actual number of skip is randomly selected in [0, maximum_skip]. In such case, we simply choose the number of skip that does not go out-of-index.

aaaaannnnn11111 commented 4 years ago

Did you used all images when you trained on coco?The coco datasets is too large...

siyueyu commented 4 years ago

Have you done random affine transform to those samples in main training @seoungwugoh

seoungwugoh commented 4 years ago

@aaaaannnnn11111 We use entire training set of CoCo. @siyueyu Yes, but done in much smaller degree than pre-training. You can consider it is a common data augmentation procedure.

siyueyu commented 4 years ago

Thanks a lot! @seoungwugoh

aaaaannnnn11111 commented 4 years ago

@seoungwugoh some images in coco have more than 90 mask objects, did you use a threshold to limit the number of objects in one image?

seoungwugoh commented 4 years ago

@aaaaannnnn11111 Yes, we randomly select 3 objects if images/video contains more than 3 objects.

siyueyu commented 4 years ago

you said maximum_skip was increased by 5 at every 20 epoch during main-training. I am wondering whether training 20 epoch is enough for each maximum_skip. Does it mean that for each maximum_skip we need to train it util it converges then change to another maximum_skip or not. @seoungwugoh

seoungwugoh commented 4 years ago

@siyueyu It is just a empirically chosen hyper-parameter, not thoroughly validated. You may train the model until it's convergence for each training curriculum.

gewenbin292 commented 4 years ago

Hi, I got a question about how to add YouTubeVOS for training the DAVIS model. Should I use ConcatDataset and WeightedRandomSampler to sample the two datasets? Or first using YouTubeVOS, than using DAVIS. Thanks!

npmhung commented 4 years ago

Hi, I got a question about training on main dataset such as DAVIS/ Youtube.

When you sample 3 frames from the video, do you resize them to (384, 384)? I you do, do you keep the aspect ratio between width and height?

seoungwugoh commented 4 years ago

@gewenbin292 We use ConcatDataset and DAVIS is weighted 5 times larger than Youtube-VOS. Trainset = data.ConcatDataset(5*[DAVISset]+[Youtubeset]) Better implementation may be possible.

@npmhung Firstly, random resizing [384, shorter_side_original_size] & random crop [384, 384] is performed. Then, affine transform is performed with following parameter range. For pre-train, rotation=[-20, 20], sheer=[-10, 10], zoom=[0.9, 1.1] For main-train, rotation=[-15, 15], sheer=[-10, 10], zoom=[0.95, 1.05] Here, zoom operation is operated independently to width and height, so It can change the aspect ratio.

pvoigtlaender commented 4 years ago

@aaaaannnnn11111 @siyueyu @gewenbin292 it seems that you wrote some training code. I think many people (including me!) would be interested in that. Could you please share that, maybe create a fork? (Sorry for posting this here, but it seems none of you provide any email address on github).

siyueyu commented 4 years ago

@seoungwugoh Sorry to tell that I can't. I failed to completely reproduce your work.

pvoigtlaender commented 4 years ago

@siyueyu can you maybe share your attempt/code anyway? It could be a good starting point for others/me to try to reproduce the results even if it does not fully work yet.

fangziyi0904 commented 4 years ago

@seoungwugoh I have some questions about training.The paper said that you used 4 gpus and the batch_size of 4.Do you calculate the four images at the same time and then update the key and value according to the time sequence?

npmhung commented 4 years ago
aaaaannnnn11111 commented 4 years ago

@seoungwugoh hi,Did you use the pretrained model(trained on coco) test on DAVIS-2017 validation set ?How about the result?

siyueyu commented 4 years ago

@pvoigtlaender Sorry to tell you that I haven't really tried reproducing the code. I only considered the idea of two-stage training. In my attempt of two-stage training, I found that it was easy to get overfitting in fine-tuning stage. So, I think some parameters matter. But I haven't got any idea of which parameters are more important.

seoungwugoh commented 4 years ago

@fangziyi0904 We used the DataParallel functionality in Pytorch. So, the gradient is computed based on 4 samples in the batch. Backpropagation is done after all the frames are processed.

@npmhung During training, in case of DAVIS, I starts with 480p. For Youtube-VOS, I starts with original resolution (mostly in 720p). Then, they are resized and cropped to be 384x384 as mentioned above. After some data augmentations, training is done 384x384 patches. Testing is done 480p resolution. In case of Youtube-VOS testing, we resize them to have 480px height keeping the aspect ratio. Fine-tuning solely on DAVIS is not recommended as model can severely overfit to the data. You may stop training early by monitoring the curve.

@aaaaannnnn11111 Please see Table.4 in our paper.

chenz97 commented 4 years ago

Hi @seoungwugoh , you said you performed random resizing to [384, shorter_side_original_size]. Just to make sure, say I have an image of [480, 854], you would resize it to [384, 480], instead of making the shorter size 384 and keeping aspect ratio to get [384, 384 / 480 * 854] = [384, 683]? Thank you!

lyxok1 commented 4 years ago

Hi, thank you for sharing the code. I still have a question about training samples. I wonder how many samples (3-frame triplet) do you use for each video. From your previous information, I understand that for each video you only sample 3 frames, so in the case of main training on DAVIS17, for example, the training set contains 60 videos, so each epoch only 60 x 3 = 180 frames are used for training ? Will too much training samples from training set result in severe overfitting ? Actually I was following your paper of main-training process on DAVIS17, and for each video I sampled 10 3-frame triplet as training sample, but I only achieves J-score of 30.3 (the main-training results in the paper can be 38.1), I hope you can give some detailed information about this, thank you.

OasisYang commented 4 years ago

@seoungwugoh hi, thanks for sharing the codes! I have some questions about how to simulate training videos using image datasets.

  1. How many datasets have you used ? According to your paper, you use lots of datasets and have you actually used all of them? And which are used for background images and which are used for foreground images.
  2. Only Affine transformation is enough? How to simulate the deformation of objects? Looking forward to your reply, thanks so much!
npmhung commented 4 years ago

Hi,

Just a silly question.

When pretraining the model on the image dataset, how do you choose the validation set?

seoungwugoh commented 4 years ago

Hi all, sorry for the late reply. Here are my answers:

@chenz97 Sorry for leading you misunderstanding. The symbol [a, b] is intended for "from a to b". We first choose the new size of shorter side between 384px and the original shorter side size. Then resize the longer side accordingly keeping the aspect ratio.

@lyxok1 We randomly choose 3 frame from each video at every iteration. All the frame in the video dataset can be used.

@OasisYang We use all the datasets written, in the paper, we used. We found the affine transform is okay for pre-training purpose.

@npmhung We did not care much about the validation during pre-training. We simply fit the model as much as possible, as our model will be fine-tuned afterward.

pvoigtlaender commented 4 years ago

For maximum_skip of youtubevos: if maximum_skip is 25, will you then skip from 00000.jpg to 00025.jpg or to 00125.jpg (since there's only one image for each 5 frames)?

Also, do you use any other kind of augmentations? I assume you use -affine -flipping -skipping -resizing -cropping What else is there, for example changing the colors?

Thanks for your help!

chenz97 commented 4 years ago

Hi @seoungwugoh , I got it, thanks for your reply!

seoungwugoh commented 4 years ago

@pvoigtlaender In addition to the augmentations you mentioned, we also use color shift that randomly multiplies a value in [0.97, 1.03] to the RGB values. That's all.

npmhung commented 4 years ago

In the fine tuning phase, how do you define "an epoch"? What I mean is that how many training samples do you have in each epoch?

And did you freeze any part of the model while fine tuning?

fangziyi0904 commented 4 years ago

Hello,I have one question about pretrain and train. 1.How much the epoch when we training?And trainging all videos or random selection some?

seoungwugoh commented 4 years ago

@npmhung @fangziyi0904 As our training strategy includes many randomness, the meaning of "epoch" may differ depends on implementations. In our implementation, 1 epoch is defined as looping for 3771 samples (the number of training videos from Youtube-VOS + 5*DAVIS). And for each videos, frames are randomly selected. Main training runs for 260 epochs (about 245K iterations with batch 4). And pre-training runs for 500K iterations (at each iteration, images are randomly sampled from one of image datasets). We do not freeze weights while the transition between training phases (BN is freezed before pre-training).

lyxok1 commented 4 years ago

Thank you for your reply~ I have another question, I see you apply the random affine transformation in both stages of main-training and pre-training, I wonder how much will these augmentation affect the final segmentation results ? e.g. If I do not apply the transformation you mentioned above in main-training, how much will the results drop?

And I wonder in main-training, whether the all frames in the 3-frame tuplete are applied with transformations of the same parameters or each frame is transformed independently thank you~

seoungwugoh commented 4 years ago

@lyxok1 The random affine transforms are essential for pre-training to synthesize frames from an image. I'm not sure about the accuracy without data augmentation because I have never done without it.

Yes, for main training, all the sampled frames are applied with the same parameters.

pixelsmaker commented 4 years ago

Hi, I've only done the main-training on DAVIS2017 without pre-training, and when I test model on DAVIS2017 validation dataset, I found that if there is only one object in the video, the results could be great, but once there are multi-objects in the video, it tends to be that only one object can get segmentation, while the remain objects' J mean sometimes are even close to 0. Does it mean the model is overfitting?

seoungwugoh commented 4 years ago

@pixelsmaker if the model works fine with one object, it should be okay for multiple objects. I think there may be some bugs when combining the probabilities. Try to segment objects one-by-one then see the results. If you don't see reasonable results when you do this, then, of course, the model is poorly trained. I recommend to use YouTube-VOS for training.

suruoxi commented 4 years ago

@seoungwugoh In main training, are the three selected frames cropped with same parameters or with different parameters individually?

huangyunmu commented 4 years ago

Hi @seoungwugoh, I am also trying to reproduce your work. I have used COCO, MSRA, ECSSD and VOC in pre-training, and follow your instructions about the affine and crop parameters (currently no color shift). However, my reproduce model can only achieve around 50 JF-mean in davis 2017, which is far from the result in table 4 (about 60). After I check the output result, I found that only the first few frames are done well. The later frames are quite bad. It seems reasonable as in pretrain, the model only need to process a 3 frame training example.

So when you test the pretrain-model, is the setting the same as the test for main training? Memory every 5 frames and go through all the frames in davis? And is any additional procedure before test for a pre-train only model? Thanks~

shoutOutYangJie commented 3 years ago
  1. We only use sampled 3 frames for training. That's the reason why we sample frames from videos.
  2. maximum_skip is maximum possible number for skip, actual number of skip is randomly selected in [0, maximum_skip]. In such case, we simply choose the number of skip that does not go out-of-index.

Hi, Does "maximum_skip" represent the interval of each nearby frames or the interval of the first sampled frame and the last sample frame?