tensorflow / models

Models and examples built with TensorFlow
Other
77.21k stars 45.75k forks source link

Inception-ResNet-v2 implementation does not match the publication it references #11227

Open jiversivers opened 5 months ago

jiversivers commented 5 months ago

The implementation here and perpetuated in Keras of Inception-ResNet-v2 and the paper are riddled with inconsistencies both between one another, and within the publication itself. Perhaps I am missing something and someone here can help me understand what, but after several hours reconstructing the coded model, I cannot reconcile it with the paper. To further complicate the matter, the paper is not consistent with itself. Again, maybe I am missing something, but here's what I have found and how I came to this conclusion:

UPDATE: I finally found another mention of this issue (#819), within which, a @ghost comments that the Google team released a "new version" of the model and @ProgramItUp posts a link to the Google Blog post. The blog post does properly describe what is seen in the code, with the exception of the branch after the second Inception-ResNet Block, that is poorly defined within the code and/or ambiguous in the post. The post does still claim that "[t]he full details of the model are in our arXiv preprint." Perhaps the most appropriate solution to this would be to simply remove the reference to the paper in the doc string and replace it with this blog post (which, I suspect is so outdated it will not be updated to also remove the paper reference, though it ought to be, for clarity).

Stem

The stem does closely match Inception-ResNet-v1 at first, but the final layer of that stem is a 3x3 Conv (256 stride 2). That layer is instead replaced by a 3x3 max pool layer that then feeds into a branched structure (mixed_5a) that does not match anything in the paper. This branched structure is a pure inception module with a final output depth of 320.

Inception-ResNet-A

This block, by the two proposed schema (Fig. 9 and 15) should either be repeated 4 or 5 times. In the code (block35), it is repeated 10 times! The block architecture is definitely passing residuals, so it is not pure inception. In fact, it does match the structure for Inception-ResNet-v2 given in the paper, with one important exception, the input-output depth is different. This inconsistency is ultimately forced by the different output depth of the stem. The final output depth of this block (by necessity due to repeating) is identical to the output depth of the preceding block, 320 in the code and 384 in the paper.

Reduction-A

Ignoring depth differences, the structure of this block in the code (mixed_6a) does match Inception-Resnet-v2 as described by the paper (Fig. 7 and Table 1) where k = 256, l = 256, m = 384, n = 384. The code outputs a depth of 320 + m + n = 1088. The paper output depth would be 384 + m + n = 1152. Remember those numbers, as they define the input depth of our next block.

Inception-ResNet-B

The structure of this block in the code (block17) matches the structure in the paper (Fig. 17), generally, but things start to get really messy with filter counts here. This block is repeated -- the paper says it should be 10 times for Inception-ResNet-v2 but the code does it 20 times -- so the output of the block must be the same depth as the input, which must be the same as the output of the preceding block. We remember that the output from the code worked out to be 1088, while the output from the paper should be 1152. Well, the code is consistent and this block takes and gives a depth of 1088. The paper, however, is now inconsistent with itself! This block in the paper has an output depth of 1154. That will not work given the output depth of Reduction-A is 1152.

Reduction-B

Again, ignoring depth differences, the code structure (mixed_7a) matches the paper structure (Fig. 18) nicely. We can now track depths again. The depths of convolutional branches in the block are 384, 288, and 320. This total is 992. The remaining branch is a max pooling layer, so its depth will match that of the input. In the paper, this input is either 1152 or 1154 and in the code it is 1088. By necessity, the output depth here must equal the next input's and, because the next block is repeated, its output depth must equal its input. That depth may either be (according to the paper) 1152 + 992 = 2144 or 1154 + 992 = 2146 or (from the code,) 1088 + 992 = 2080.

Inception-ResNet-C

In the code (block8) the structure here matches the paper in the same was as its predecessors: the repeat number is off by factor of 2 and the final output depth is inconsistent. Aside from the stem, the code is at least consistently inconsistent. The paper, though, gets worse from here. Remember, the output of this layer must either be 2144 or 2146 to be consistent with any part of the paper. It is not. Figure 19 clearly shows a final 1x1 Conv with 2048 channels. This output matches neither of the paper's own options nor the code.

What's left to do but throw my hands up and say, "I don't know." On top of it all, the caption for figure 18 mentions Inception-ResNet-v1 when it must be referring to Inception-ResNet-v2, just to add to the mess. Initially, my assumption was that the paper was the ground-truth, but now, I am not so sure. Is it possible that the paper was written about the wrong version of the final best-performing model(s)? After all, the code does work and is self-consistent. Perhaps, it's the paper that's actually not consistent with the original code. By the authors' own words, they "...tried several versions of the residual version of Inception." Could they have gotten modules mixed up in the final publication? It certainly wouldn't be crazy, given the number of fine details they had to keep track of across "several" models. To quote a Season 46 survivor, "Last time I checked, several is seven" and seven models (more or less, because I know better than him) is a lot when each model is this deep.