Closed natlachaman closed 7 years ago
@natlachaman Hi!
I'm not sure, how exactly min_size
and max_size
correspond to the paper, because it really seems, that they have changed the choice of default boxes in one of the architecture versions, but haven't changed the explanation in the paper. Definitely, min_size
and max_size
are in pixels both in my implementations and the original one (in the original Caffe
implementation the authors also use the same values in pixels). To make long story short, in the paper they describe one way of choosing priors, but they implement another one, that is ported here.
I am not sure, if SSD
is able to detect such small object, probably, you need to change the architecture, e.g. to take features from conv3
block in order to be able to deal with such small objects (but it is a question, whether it helps). Moreover, you can check the last revision of the paper, where the authors introduce some data augmentation technique, that appeared to be useful for small objects.
Actually, this port is a little bit outdated, because the authors keep trying different architectures and try to add improvements, but fortunately, all of their current improvements can be added to my port quite easily if one wants to do it.
@rykov8 Thanks for your quick reply!
It just seems odd to me since the ground truth coordinates are passed as relative coordinates so it'd be independent of the image size, but then you need to tune min_size
and max_size
manually in pixels and scale them accordingly.
In any case, thanks for the clarification!
Also, I changed the model a bit. I shorten it some and try to feed the PriorBox
layer with lower output layers in the model. Thing is, feature maps output by lower conv nets are much larger than deeper ones and I end up with a crazy number of prior boxes which affects the performance greatly.
I'll see what I can do.
A bit off topic but also very quick question: Did you use PASCAL VOC2007 for the results you uploaded? I'mm trying t reproduce your results but I haven't succeed so far. I checked the names of the image files from what I have and they don't seem to match. So I was wondering if I jsut got that part worng. I'm getting the data from the official PASCAL site http://host.robots.ox.ac.uk:8080/pascal/VOC/voc2007/
Thanks a ton!
natlachaman
@natlachaman relative coordinates are good, because you can resize input image from, say, 640x480 to 300x300, but you don't need to rescale the bounding boxes. On the contrary, the input image to the net is always 300x300 (you may change it, but the architecture is designed for this input, the authors also have one for ~500x500 pictures and it is a bit different), so, probably, that is why it is ok to choose sizes of priors in pixels. However, you are right, probably it was better to leave sizes of priors as scales, but I followed the original implementation and didn't consider my own improvements.
As for your last question. What results do you mean? If you are speaking about training example, I used my own small dataset, that is very different from PASCAL. If you are speaking about the weights, they are ported from the original Caffe
implementation, but as stated in #7, I didn't check them on PASCAL.
@rykov8 Oh! I didn't know the image size had that effect on the network. In the apepr they mentiones that they had better performance with larger images but didn't know they developed different architectures for different image sizes. Good to know!
As for the PASCAL question. I missed that ! (#7) I thought you trained it on PASCAL VOC2007. In any case, the model you implemented follows the original implementation so in theory it should work relative fine. I used the same data format as you did for your own dataset, resize the images to 300x300 but still get a really strange behaviour: the error grow shoots like crazy half way the first epoch and I can't figure out why.
Thanks for your time, always very helpful :)
@natlachaman I'm not sure, that the architectures are different a lot for different input sizes (the idea is the same for sure), but if I am right, the net for 500x500 images is a little bit deeper. Anyway, you can check their prototxt files just to understand the architectures. Moreover, as I have mentioned, in the third revision of the paper they have changed a little the architecture for 300x300 pictures.
As for the error, do you use Adam
? It always helps me to throw away SGD
and use Adam
, because I don't have a magic skill of tuning the learning rate.
I also have a small question to you. Probably, you have the implementation of MAP metric, as it is computed in PASCAL? I am implementing it (because I failed to find the implementation, that is quite strange, though), but I'm too lazy to finish. If you have, feel free to make pull request or post a link to someone's implementation.
@rykov8 I use Adam or Rmsprop usually, for the same reason. No magic powers so far hehe.
As of your question: No, I don't. Implementing MAP is def in my list. I started working with the SSD last week, on and off, so I was mainly focus on getting it to work on my dataset first. But I'll for sure make a pull request whenever (and if I get further with SSD) I have MAP implemented or refer you to other work if stumble upon something interesting.
Thanks again for you help!
@natlachaman you are welcome :)
Hi again!
Im trying to use your implementation on a different problem than PASCAL VOC dataset suggests. In my case, I need to identify much smaller objects (ground truth boxes are 50px50p in 768X1024 images). For what I've seen so far
min_size
andmax_size
determine the dimension of the default boxes. Are these parameters implemented to be pixels? or what are they? Cause in the paper they talk about scales, with values ranging from 0 to 1, and I'm not sur eif you implemented a different version of it and conceptually they do the same or if I'm mixing up concepts.Thanks in advance!