probml / pyprobml

Python code for "Probabilistic Machine learning" book by Kevin Murphy
MIT License
6.54k stars 1.53k forks source link

Resolve current workflow errors #759

Open patel-zeel opened 2 years ago

patel-zeel commented 2 years ago

Solve the current errors in the notebooks mentioned in #739.

Currently, 25 notebooks are failing due to various minor issues.

karm-patel commented 2 years ago

Okay, I'll take this

karm-patel commented 2 years ago

Fixed Notebooks

Notebook fig_no Error fix
multi_collinear_legs_numpyro.ipynb 11.23, 11.24 TypeError: TruncatedNormal() takes from 0 to 2 positional arguments but 3 were given replaced dist.TruncatedNormal(0, 0, 100)) to dist.TruncatedNormal(loc=0, low=0, high=100))
linregRbfDemo.ipynb 13.22 FileNotFoundError: [Errno 2] No such file or directory: 'figures/figures/rbfDemoALL.pdf' Replaced savefig(figures/linregRbfDemo.pdf) to savefig(linregRbfDemo.pdf)
manifold_digits_sklearn.ipynb 20.30, 20.31, 20.33, 20.36, 20.37, 20.38, 20.41 AttributeError: module 'umap' has no attribute 'UMAP' 1. Replaced pip install umap to pip install umap-learn 2. Removed savefig() custom function and added from probml_utils import savefig 3. Use of arguments in manifold.LocallyLinearEmbedding() and manifold.Isomap() were deprecated. I updated it’s arguments by referring latest documentation
manifold_swiss_sklearn.ipynb 20.30, 20.31, 20.33, 20.36, 20.37, 20.38, 20.41 TypeError: init() takes 1 positional argument but 3 were given updated arguments of manifold.LocallyLinearEmbedding()
karm-patel commented 2 years ago
  1. Kernal not responding (5 notebooks) book1/01/mnist_viz_tf log book1/02/robust_pdf_plot log book1/04/postDensityIntervals log book1/04/biasVarModelComplexity3 log book1/07/gaussEvec log

  2. figures not found in savefig("figures/...") (8 notebooks) book1/16/parzen_window_demo2 log book1/17/svm_classifier_feature_scaling log book1/18/rf_demo_2d log book1/20/pca_projected_variance log book1/20/kpcaScholkopf log book1/20/pcaImageDemo log book1/20/vae_mnist_conv_lightning log book1/21/kmeans_silhouette log

  3. General Errors (15 notebooks) xdg-open book1/19/hbayes_maml log umap book1/20/ae_mnist_conv log test_ndx book1/13/mlp_mnist_tf log Spams book1/11/groupLassoDemo log Normal, Shape book1/08/sgd_comparison log Normal, Shape book1/13/mlp_1d_regression_hetero_tfp log No module named 'tensorflow' book2/20/LVAE log Module Not found error book1/01/text_preproc_jax log KeyError: f_pred book1/17/gp_classify_iris_1d_pymc3 log invalid syntax book1/17/gprDemoArd log HTTP Error 500 book1/14/cifar10_cnn_lightning log HTTP Error 500 book1/15/kernel_regression_attention log HTTP Error 404 book1/21/gmm_identifiability_pymc3 log Axis limits cannot be NaN or Inf book1/08/lrschedule_tf log OSError: Reader needs file name book1/20/pcaStandardization log

  4. d2l-notebooks (19 notebooks) book1/14/batchnorm_torch log book1/14/resnet_torch log book1/14/densenet_jax log book1/14/batchnorm_jax log book1/14/resnet_jax log book1/15/cnn1d_sentiment_jax log book1/15/rnn_sentiment_jax log book1/15/rnn_torch log book1/15/positional_encoding_jax log book1/15/entailment_attention_mlp_jax log book1/15/rnn_sentiment_torch log book1/15/bert_torch log book1/15/cnn1d_sentiment_torch log book1/15/positional_encoding_torch log book1/15/rnn_jax log book1/15/entailment_attention_mlp_torch log book1/19/finetune_cnn_torch log book1/19/finetune_cnn_jax log book1/19/image_augmentation_jax log

nalzok commented 2 years ago

Hi @karm-patel @murphyk, apparently most of the D2Ls are failing due to the limitation of the workflow environment? I am seeing a lot of Cell execution timed out in the training loop and pip install, and some memory errors such as this one. The code looks fine; they just cannot finish running within 600 seconds. Is there any way we can loosen the restrictions?

karm-patel commented 2 years ago

Hi @nalzok,

  1. For cell execution timeout I think you can increase the time out (maybe up to 1200 s?), because I think there is no harm if our workflow runs for a long time.
  2. For memory error, I'm not sure what we can do, do you have any solution? Can we use dataset splits in tfds to partially load the dataset? @patel-zeel , Dr @murphyk, would you like to suggest anything?
nalzok commented 2 years ago

Thank @karm-patel, I just did a quick commit to increase the timeout to 1200 seconds: https://github.com/probml/pyprobml/commit/17909018b23f13745998bbf846132be7a0c90d82.

nalzok commented 2 years ago

@patel-zeel Just to check, there are no GPUs on the workflow machines, right? No wonder why everything times out if that is the case: it's just painfully slow to train neural networks on CPUs, and bumping the timeout to 1200 seconds doesn't help much. Additionally, the notebooks that assume the presence of a GPU will never pass, e.g. book1/13/multi_gpu_training_torch.ipynb.

Anyway, I am changing the timeout back to 600 seconds so that we can have a faster feedback cycle.

patel-zeel commented 2 years ago

@nalzok That's true. The only option to run the workflow on GPUs is to self-host such runners. I see these immediate solutions waiting for your input: 1) We can add a mechanism in code that detects CPU or GPU. But I think JAX does not require such a mechanism and it adopts to jax[cpu] or jax[cuda] whichever is installed. WDYT? For torch, I have seen the following idiom:

device = "cuda" if torch.cuda.is_available() else "cpu"

2) To reduce the training time, we can add the environment variable WORKFLOW_RUN to reduce the number of iterations to a bare minimum. Due to this, we will not be able to extract the saved PDFs directly from auto_generated_figures, but for special cases like this, we can run the notebooks locally and extract the figures.

nipunbatra commented 2 years ago

@nalzok @patel-zeel Would it be useful to store the model (if not huge size?). Then, perhaps we will train only if the stored model does not exist? Of course this approach has the issue that if we always use the stored model, we will be unable to catch workflow errors in the training loop.

nalzok commented 2 years ago

@patel-zeel @nipunbatra Thanks for the input! It seems that all workarounds introduce some amount of complexity, but are imperfect in one way or another. We are essentially emulating a GPU-equipped Colab runner with a GitHub runner, but the impedance mismatch is giving us a hard time. This makes me wonder if it is possible to reverse engineer the Colab API to execute the notebook specified by a URL, e.g. https://colab.research.google.com/github/probml/pyprobml/blob/master/notebooks/book1/14/resnet_torch.ipynb. Pinning @murphyk: do you happen to know someone from the Colab department who could offer some help?

Besides that, I'm thinking if it's really worth the effort. I mean, the D2L notebooks seem quite independent from the rest of the repository, so they are unlikely to break no matter how the codebase evolves. We can manually go over each notebook to make sure it works on Colab, and they will stay correct for a long time until there is something like a breaking change. When that happens, hopefully someone will open an issue and we would be notified. Excluding all D2L notebooks from the workflow also reduces the execution time of each workflow run and accelerates the feedback cycle. What's your opinion about that?

nipunbatra commented 2 years ago

I agree that it may not be worth the effort to put D2L notebooks on the workflow as they are indeed very specialised. Whitelisting them from the workflow and manually fixing the issues might be a good use of our time.

karm-patel commented 2 years ago

Yes, I agree, @nalzok I think we can put these notebooks in [IGNORE_LIST](https://github.com/probml/pyprobml/blob/master/tests/test_notebooks.py#:~:text=IGNORE_LIST%20%3D%20%5B%5D,strip().split(%22/%22)%5B%2D1%5D). I've created copied_from_misc.txt which contains a list of some notebooks which are being ignored by test_notebooks.py because these notebooks are just for tutorials and not part of the textbook. So I guess you can add d2l-notebooks in this .txt file and maybe rename this file to a better meaningful name (ignored_notebooks.txt?)

nalzok commented 2 years ago

Hi @karm-patel, thanks for the tips. It has been done!

karm-patel commented 2 years ago

Some non-trivial notebooks execution errors in book2

  1. smc_tempered_1d_bimodal.log, linreg_hierarchical_non_centered_blackjax.log - These notebooks are implemented in old blackjax version, so need to refactor it.
  2. variational_mixture_gaussians_demo.log, thompson_sampling_linear_gaussian.log: from jax.ops import index_update in probml_utils.variational_mixture_gaussians, seems jax version updates, https://github.com/google/jax/issues/10293
  3. vb_gmm_tfp: tensorflow version conflicts
  4. adf_logistic_regression_demo.log: not able to find logreg_biclusters_demo.py in jsl.demos
  5. lecun1989_flax: Need to test in colab, taking more than 1200 s