(Sub)Set of images used for Human psychophysical trials

ArturoDeza commented 5 years ago

Hi Robert, Is there a subset of images from the 200k of ImageNet-16 that were used for the psychophysical testing on humans for the different distortions? For example: 1000 per class that were all degraded in the same way per strength of distortion such that the 3 observers saw the same stimuli. Or did you randomly sample from the total image population and apply a different distortion at a specific level and averaged the accuracy across all the randomly sampled images for that specific group (in a bootstrap-like fashion)?

rgeirhos commented 5 years ago

Hey Arturo, From the pool of 16-class-ImageNet images, we randomly selected a balanced number of images (same number for each of the 16 classes) and then distorted those images. Sampling was done independently for each observer with repetition, i.e. different observers usually saw different images. The same image was never seen more than once per observer; and by "same image" I'm referring to the same basis image - i.e., an observer also didn't see the same image distorted by different distortion strengths. Accuracy was then computed as the performance for a certain distortion strength across observers. Does this answer your question?

ArturoDeza commented 5 years ago

I see sounds good! Wanted to check as we're running some experiments with your ImageNet16 dataset Robert! I guess this leads me to the follow-up question: So to compare Humans and Machines in the curves (Figure 3 from your NIPS paper and Figure 6 from your ICLR paper), essentially the images that were used for the machine were the exact same as the ones the humans saw? Or should I be plotting the output of all the images from the training set to get the upper bounds and comparing them to the sampled ones the humans saw in the same plot?

rgeirhos commented 5 years ago

Ok cool. The exact images used for CNN testing can be found here: https://github.com/rgeirhos/generalisation-humans-DNNs/tree/master/raw-data/TF

We didn't show the exact same images (human observers also didn't necessarily see the same images as other observers), but sampled from the pool just like for any observer. This is the description from the appendix of our NeurIPS paper:

When showing accuracy in any of the plots, the error bars provided report the range of the data observed for different observers (not the often shown S.E. of the means, which would be much smaller). To produce acomparable measure of uncertainty for the DNNs, we computed seven runs with different subsets of the data, with each run consisting of the same number of images per category and condition that a single human observer was exposed to and report the range of accuracies observed in these runs. Seven runs are the maximum possible number of runs without ever showing an image to a DNN more than once per experiment.

ArturoDeza commented 5 years ago

Thanks a lot Robert!

rgeirhos / generalisation-humans-DNNs

(Sub)Set of images used for Human psychophysical trials #2