different setup of input_hint_block compared to paper?

Hi, i noticed that the implementation of the tiny work converting control images into feature space is different from the structure menioned in the paper: "In particular, we use a tiny network E(·) of four convolution layers with 4 × 4 kernels and 2 × 2 strides (activated by ReLU, using 16, 32, 64, 128, channels respectively". The corresponding implementation should be here right(correct me if i am wrong): https://github.com/lllyasviel/ControlNet/blob/ed85cd1e25a5ed592f7d8178495b4483de0331bf/cldm/cldm.py#L147-L163

lllyasviel / ControlNet

different setup of input_hint_block compared to paper? #698