microsoft / archai

Accelerate your Neural Architecture Search (NAS) through fast, reproducible and modular research.
https://microsoft.github.io/archai
MIT License
467 stars 92 forks source link

[BUG] some minor notebook bugs #155

Closed lovettchris closed 1 year ago

lovettchris commented 1 year ago

Describe the bug

This is unfiltered "first impression" playing with some of the new jupyter notebooks. The notebooks are great, but there's a lot of overlap across them, is that by design? For example, I the notebooks in docs\getting_started\notebooks\cv\pl_trainer.ipynb seem to be low level and doesn't show the connection to archai so much as algos.ipynb which ties things together and also shows how to create the MnistDatasetProvider already, so is this redundancy by design or is it just a work in progress?

Feedback on docs\getting_started\notebooks\cv\pl_trainer.ipynb

  1. Some typos :
  1. Every time I execute this block the val_loss increases:

image

After about 10 runs I see this:

image

If I re-execute the initial code blocks to recreate the dataset and model it drops back to around 2. Is the model stateful or something? Perhaps model = Model() should be in the last block not the block that defines the model?

  1. When I set max_epochs=100 instead of max_steps=1, the progress bar output is confusing as it says "1/1":

image

But I think it did do the 100 epochs?

  1. When I ask for the test dataset: test_dataset = dataset_provider.get_test_dataset() I get a weird error saying archai.datasets.cv.mnist_dataset_provider — WARNING — Testing set not available formnist. Returning validation set which is weird. It would be nice if we could show proper procedure here even with MNist and add the test code block trainer.test(model, DataLoader(test_dataset)) at the end.

  2. It is unfortunate that the PlTrainer takes "accelerator='gpu'" whereas the PartialTrainingValAccuracy in algos.ipynb takes device="cuda"...

gugarosa commented 1 year ago

I would say that the redundancy is a work in progress since I drafted these notebooks out of my mind, without actually knowing that they would be sufficient for users. Nevertheless, I am glad that you are offering great feedback and I will definitely improve their design and connection to resemble the ones that Piero has implemented for discrete search.

Answer to feedbacks

  1. Fixed.

  2. Yes, model = Model() should be in the last cell otherwise it is just accumulating the losses.`

  3. Did you, by any chance, removed these arguments limit_train_batches=1, limit_test_batches=1, limit_predict_batches=1? They are probably limiting the number of batches that are being passed during training, and even though, the max_steps or max_epochs is set to a higher number, they take precedence and limit the training to a single batch (would explain the progress bar with only 1/1.)

  4. By default, MNIST only offers two sets, which in our case is being used as training and validation. Do you think it's worth splitting the validation into validation/test and them allowing get_test_dataset() to be invoked? Or should we just improve the warning message?

  5. Unfortunately the accelerator=gpu is an argument from the PyTorch-Lightning trainer. We could do one of the following:

    • Force a keyword argument to PlTrainer with device and use it as the accelerator in the class constructor.
    • Change PartialTrainingValAccuracy to use accelerator to device.

    What do you think it would be better?