Closed emepetres closed 4 years ago
Hi,
I think one major drawback of our algorithm is that it can not handle large shadows and reflections. When you take an image with a full standing pose, there is a significant shadow/reflection on the floor. In this case, you see the shadow of the leg on the floor. The network can still remove most of the shadows, but some stay with the shoe. The main reason is the lack of labeled training data. In fact, there is not a single training data with GT alpha for standing humans. Thus we decided to train our method on human images/videos until hip or knee-length. This also has a profound impact on the input aspect ratio/resolution. Thus it can perform worse around hands/fingers when it sees a full standing image (which has a much bigger height than width). To test this, you can crop the image just above the knee and run the same algorithm to see if it can get the hands right.
As for self-supervised adversarial learning, it is always a bit tricky. Hard to say what made it worse. I would suggest choosing a target background as something similar. For example, you can take a video by moving your camera through your apartment, make sure to also include the floor, basically similar pose as our main images. If you do not have too many training images I would suggest mixing it with some of our training data that are somewhat similar.
In conclusion, I think the error around the hands will improve when you crop above the knee. The errors around the shoe are more of a drawback and I am not sure how to fix this easily. We are working on an extension where we expect to handle these issues :)
Thanks @senguptaumd , cropping above the knee indeed improved the resutls, mostly on hands. However I'm still getting some errors when I move towards the spotlight and the door that I have to my right (see image below).
To improve these errors, I'm going to re-train the model using that video and another similar, cropped above the knee. Should I use as train input the fixed-cam model, or start over using the adobe model and training with you dataset plus my two videos?
As for generating new target background, I'm using spotlights to record the videos so I imagine that record a background moving the camera around would not work in this case, because light would change a lot respect original, am I right? I suppose that using a target background video generated by moving the camera around is still valid when re-training a fixed camera model.
The errors when you get close to the door is sort of expected. You are probably casting some shadows on the door, plus the color of the hand and door are quite similar, which makes it hard for the network to make a good inference. As we have mentioned, ideally we expect you to stand at least 3-4ft in front of the backdrop, like walls.
Retraining Adobe model does not make much sense as you are not adding any synthetic data. So there are couple of options to finetune: (a) Do not initialize G_real and consider G_adobe as the model trained on adobe data. However, since you have very few videos, this is unlikely to help. So other options are (b) Initialize G_real with fixed-cam model and G_adobe as the model trained on adobe, then it won't expect so much data. (c) Do not initialize G_real, and use the fixed-cam model as G_adobe.
Using a target background captured with moving camera is fine even when training on fixed camera. The network is trained per frame so discriminator never looks at the image sequences. All it matters is having similar looking videos.
To be honest, I am not entirely sure how it will work with finetuning this network on your own captured videos, which is very few 1-2. Our real training data consisted of at least 20 videos, which gives the network enough diversity to learn. Otherwise, with very few videos, the discriminator can become too strong. You can throw some videos we captured with fixed cam in training data. Let me know if this strategy works out.
Thanks @senguptaumd for your suggestions.
Regarding implementing option b), I am deducing the following:
netG
at train_real_fixed.pynetB
at train_real_fixed.pyIs that correct, or I missing something?
correct :)
Ok so I tried option b) with some videos, but the resulting model, while its output is quite good, it is slightly worse than original. Generator alpha loss didn't converge, and actually it went increasing during the training.
Now I have more videos, and I'm going to try option a) mixing fifteen videos of mine with some of your original training dataset, and see what happens. I'll let you know the result, fingers crossed!
GANs are quite tricky to train. In most cases, the discriminator wins and the generator loss will go up. Try early stopping or taking the checkpoint before alpha loss becomes too high. Remember your alpha supervision is not GT, so the loss can go up a little.
Also, if you are trying to solve the problem of strong shadows on the wall and floor, I suggest looking to use more synthesized data. That can help more.
Well what I would like to have is a model more tolerant to videos with back shadows, specially between the legs and shoes.
For this case, I imagine using option a) with real data showing these shadows would be the best option, instead of using synthesized data which wouldn't have those. Am I wrong?
Real data will have real shadows but it will not have any GT labels (alpha matte), so you have to really a lot on self-supervised learning with GAN. Synthetic data will not have a realistic shadow, but it will have GT supervision over alpha matte, which can help it to be robust to some form of shadows. A combination of both may be better in tackling the problem. We are also exploring how we can be more robust w.r.t. shadows.
Great, I understand. My last training mixing 15 samples of new videos with 10 of your original dataset went well, but the resulting model outputs aare almost the same as yours.
So I guess I will have to use more synthesized data as suggested.
I'm closing this issue then, thanks!
Hi, first of all great work!
I'm testing your fixed-camera model on full body standing videos (with a fixed camera obviously) and, although is pretty good, there are still some errors in feet, near hands and between the legs.
After reading your post on towardsdatascience, I've retrained your final model with a couple of these videos, but, contrary to what I expected, the resulting inference has been slightly worse. I'm using the captured bacgrounds provided.
According to documentation, target backgrounds should have roughly similar lighting as the original videos. Could that be the cause? If that so, how could I create backgrounds with similar light as the video I'm trying to process?
Alpha mask with original fixed-cam model:
Alpha mask with retrained model:
Original Image: