sergeyk / vislab

Set of modules and datasets for visual recognition.
http://sergeykarayev.com/vislab/
Other
124 stars 67 forks source link

Max Across Image Crops? [Just a Question - Not a Bug] #14

Closed mgh1 closed 10 years ago

mgh1 commented 10 years ago

Hi Sergey,

A basic question here. In /vislab/features/misc.py, line 60+, you wrote:

       # First, run the network fully forward by calling predict.
       # Then, for whatever blob we want, max across image crops.
        net.predict([caffe.io.load_image(image_filename)])
        feats.append(net.blobs[layer].data.max(0).flatten())

I'd like to understand why you chose the "max" crop. Firstly, what does it mean to be the "max across image crops"? In the Caffe feature visualization tutorial it indicates one can just select the center crop. In other discussions people have mentioned that one should "average the crops." Would you mind to compare your "max across image crops" with the aforementioned alternatives and why is it the best way to go and / or pros/cons?

Thanks!

sergeyk commented 10 years ago

I don't know if it's the best way to go for every task (you'll have to look at the accuracy numbers for different choices), but the intuition is as follows:

On Sun, Sep 14, 2014 at 8:33 AM, MGH1 notifications@github.com wrote:

Hi Sergey,

A basic question here. In /vislab/features/misc.py, line 60+, you wrote:

   # First, run the network fully forward by calling predict.
   # Then, for whatever blob we want, max across image crops.
    net.predict([caffe.io.load_image(image_filename)])
    feats.append(net.blobs[layer].data.max(0).flatten())

I'd like to understand why you chose the "max" crop. Firstly, what does it mean to be the "max across image crops"? In the Caffe feature visualization tutorial it indicates one can just select the center crop. In other discussions people have mentioned that one should "average the crops." Would you mind to compare your "max across image crops" with the aforementioned alternatives and why is it the best way to go and / or pros/cons?

Thanks!

— Reply to this email directly or view it on GitHub https://github.com/sergeyk/vislab/issues/14.

mgh1 commented 10 years ago

Thank you Sergey! Just a quick follow up which demonstrates my naivety. I am trying to implement something like neural codes and trying to see how net.blobs[layer].data.max(0).flatten() can play a role. So for neural codes on a standard AlexNet, I’d be using a lower layer like FC6, which means I’m probably dealing with vectors of 4096 components. It seems that the .data field may contain say 10 vectors (each one corresponding to a crop, and each one again 4096 in size). So how does taking the .max(0) of each vector and then flattening it to one vector relate to what you wrote about this code helping to “represent the image with all the parts that are in it”? I suppose I do not know what .max(0) is really doing here to get to this goal. If it is just taking the max value of the 4096 values for each of the 10 channels, and then using flatten() would be just getting an array of 10 values finally? Sorry for the detailed question, I’m really just trying to learn here. I’d also like to take this opportunity to thank you for your strong contributions to the field and community.

sergeyk commented 10 years ago

What you're doing is taking the max across the 10 vectors for each dimension. You'll end up with a 4096-length vector. Try putting in an 'from IPython import embed; embed()' in the code where that max is done, and then running whatever code you run to get your stuff. This will drop you into an Ipython shell at that line, and you can look at what shape all the arrays are.

On Mon, Sep 15, 2014 at 8:54 AM, MGH1 notifications@github.com wrote:

Thank you Sergey! Just a quick follow up which demonstrates my naivety. I am trying to implement something like neural codes and trying to see how net.blobs[layer].data.max(0).flatten() can play a role. So for neural codes on a standard AlexNet, I’d be using a lower layer like FC6, which means I’m probably dealing with vectors of 4096 components. It seems that the .data field may contain say 10 vectors (each one corresponding to a crop, and each one again 4096 in size). So how does taking the .max(0) of each vector and then flattening it to one vector relate to what you wrote about this code helping to “represent the image with all the parts that are in it”? I suppose I do not know what .max(0) is really doing here to get to this goal. If it is just taking the max value of the 4096 values for each of the 10 channels, and then using flatten() would be just getting an array of 10 values finally? Sorry for the detailed question, I’m really just trying to learn here. I’d also like to take this opportunity to thank you for your strong contributions to the field and community.

— Reply to this email directly or view it on GitHub https://github.com/sergeyk/vislab/issues/14#issuecomment-55611378.

mgh1 commented 10 years ago

Thank you Sergey, this makes sense and thanks for the debugging technique.