Hello, I noticed that the HNeRV in your code seems to differ slightly from the Architecture in the paper.
In the paper, first use convnext for encoder, then use a learning based embed, and finally use decoder. As shown in the following figure.
![Uploading 7ac48adaab7dceeab45206a4e618c8c.png…]()
But in the code, I found that you use optional positional encoding for embeddings, and then use convnext for encoder, followed by decoder, as shown in the following figure. Why is there a line of code img_embed=self. encoder (input)? What part of the code for learning based small embeddings?
Hello, I noticed that the HNeRV in your code seems to differ slightly from the Architecture in the paper. In the paper, first use convnext for encoder, then use a learning based embed, and finally use decoder. As shown in the following figure. ![Uploading 7ac48adaab7dceeab45206a4e618c8c.png…]() But in the code, I found that you use optional positional encoding for embeddings, and then use convnext for encoder, followed by decoder, as shown in the following figure. Why is there a line of code img_embed=self. encoder (input)? What part of the code for learning based small embeddings?