For reference, the reason 100 is the default value is because we use the same root decomposition elsewhere (like for sampling), where we explicitly want to run to very fine convergence, not just get a kernel approximation.
You had num_dims=18 hard coded in to the kernel, but it needs to match the data dimension size.
In general, because you don't learn the inducing point locations with SKIP, we recommend much higher learning rates and fewer training iterations. The experiments in the paper were run for at most 30 iterations at a learning rate of 0.1.
I also included a comment about how to initialize hyperparameter values.
See response email coming shortly. Basically, there were a couple of issues leading to the problems you saw:
num_dims=18
hard coded in to the kernel, but it needs to match the data dimension size.0.1
.I also included a comment about how to initialize hyperparameter values.