wandb / examples

Example deep learning projects that use wandb's features.
http://wandb.ai
1.12k stars 289 forks source link

We should incorporate the Colabs into this repository #28

Closed charlesfrye closed 3 years ago

charlesfrye commented 4 years ago

Rationale

The Colabs are an important part of the purpose of this repo: allowing users to get a sense for how wandb works in their use case.

For that reason, we should incorporate them and control their versions. Not least so that I can stop bugging @ayulockin with issues and start bugging him with PRs.

I will open a PR shortly that demonstrates how this would work for a single Colab, but as this is a big change, I wanted to track it with an issue as well.

Process

From scratch, the process operates as follows:

  1. A Colab author writes new material in a Colab hosted on their personal Drive. They iterate until it is ready to be committed.
  2. For notebooks on which we want to preserve outputs, they restart the kernel (resetting to factory preferred, to check for errors in downloading/installation) and re-run the notebook. See below for more notes on VCS best practices with notebooks.
  3. They click "File > Save a copy in GitHub" in the toolbar.
  4. The author then sets the repository, branch, and file to commit, using the dialog box pictured below. Ideally, new branches are created and merged through PRs. New branches can be made completely within GitHub, so just click the link in the dialog box. You may need to re-open the dialog box once the branch has been created. The file name should be whatever is automatically generated by Colab based on the title (sans Copy_of_ prefix). The path should begin colabs/{identifier/}?, where the {identifer} directories help organize the Colabs, as in the examples directory. The ? indicates that one or more identifiers can be used. For Colabs, it's probably only necessary to have one, since each example only has a single file (the notebook), whereas the examples tend to have entire folders.
  5. The "include a link to Colaboratory" box should always be ticked. This results in the creation of a button at the top of the notebook stored in GitHub that can be used to directly open up the notebook in Colab. It appears below.
  6. The author then writes a commit message and clicks OK. Note that you will not be able to view a diff unless you do a PR, at which point the ReviewNB tool will pop up.
  7. If making a PR, the author then does so via the GitHub interface.

For changes to existing notebooks, we enter at step 2, by clicking the badge: Open In Colab.

Best Practices for Version Control in Notebooks

I've VC'd two large repos of notebooks: one well and one poorly. Here's what I've learned in the process:

  1. Avoid including outputs whenever possible. Outputs, especially large tables and binaries (eg media), add weight to the diffs. Only include them when they're an important part of the reader experience (eg wandb outputs). Examples: Use %%capture on installation cells. Present media as screenshots in Markdown, rather than using IPython to render them, as these are less likely to change. Use the "private outputs" setting if possible.
  2. Reduce volatility in outputs. When outputs vary from execution to execution, they add noise to the diff. Examples: Instead of logging a pseudo-random data value, log a specific value. Instead of printing a value, assert something is True.
  3. Always fix the random seed. This is a specific instance of the above, but it's important enough to bear repeating. Note that there may be more than one PRNG (eg python and TF), so make sure to seed all of the relevant PRNGs. Double-check that your outputs don't vary! Examples: random.set_seed(42); tf.random.set_seed(117). GPUs are not always deterministic; see this StackOverflow post.
  4. Design your code to be run linearly. That is, treat the code cells as though they were going into a script, to be executed in order. Otherwise, step 2 above has to be done manually, rather than just by clicking "re-run all cells". In the future, it might even be automated. Examples: Copy code and make changes, rather than directing users to make changes.
  5. Never throw an error. This is again an important special case: if a code cell throws an error, execution will stop. This is especially important for future automatibility. Examples: Use try and except rather than showing an uncaught Error.
  6. Pin versions. As versions change, DeprecationWarnings appear, behavior slightly changes, etc. If you include a pip install step for anything other than wandb, always pin the version. Unfortunately, the Colab environment is out of our control, so changes to the environment will happen anyway. Examples: %%capture\n pip install -qq numpy==1.16.
  7. Be mindful of iteration time. This reduces the burden of re-executing the notebook and often improves the reader's experience. Remember that the point of the examples isn't to achieve SOTA performance. Examples: reduce the length of Sweeps (and always set a length, due to principle 4). Subsample datasets by batch (e.g. x[::5], y[::5]) and work with the smallest reasonable dataset.

Here's a snippet that can be used to get deterministic behavior in TF:

# Ensure deterministic behavior
os.environ['TF_CUDNN_DETERMINISTIC'] = '1' 
random.seed(hash("setting random seeds") % 2**32 - 1)
np.random.seed(hash("improves reproducibility") % 2**32 - 1)
tf.random.set_seed(hash("by removing stochasticity") % 2**32 - 1)

and one in PyTorch:

# Ensure deterministic behavior
torch.backends.cudnn.deterministic = True
random.seed(hash("setting random seeds") % 2**32 - 1)
np.random.seed(hash("improves reproducibility") % 2**32 - 1)
torch.manual_seed(hash("by removing stochasticity") % 2**32 - 1)
torch.cuda.manual_seed_all(hash("so runs are repeatable") % 2**32 - 1)

Appendix: Dialog Box on Colab

image

charlesfrye commented 4 years ago

@lavanyashukla, would love your input on this (incl. just a πŸ‘) and if you could rope in anyone besides Ayush who's working on the Colabs in this repo.

ayulockin commented 4 years ago

I just have one thing to say. This is a well put guideline @charlesfrye.

Will just add one pointer:

ayulockin commented 4 years ago

One more pointer:

charlesfrye commented 4 years ago

@ayulockin I appreciate the points about code readability! I agree with all of them. Combined with the points about VC, we're starting to get something of a Colab Style Guide going. Might write that in Notion once I've got a few more under my belt.

For authorship -- I agree with you that we need to track that better. But once we've got the Colabs in GH, the authorship will be in the history. I'd rather keep it there, rather than exposing it to readers as a section.

ayulockin commented 4 years ago

Notion doc on Colab Style guide would be awesome.

I will agree on your take for authorship. πŸ’―

charlesfrye commented 4 years ago

Colabs to Integrate

From README.md:

From Carey:

From elsewhere

neomatrix369 commented 4 years ago

Great to see so much activities going on in this repo - cant wait to see the outcome when these PRs are all merged

neomatrix369 commented 4 years ago

Notion doc on Colab Style guide would be awesome.

I will agree on your take for authorship. πŸ’―

Whats this Notion Doc you two have been talking about, I must be missing all the nice notebook/colab/kernel goodies

charlesfrye commented 4 years ago

Glad you're as excited as I am, Mani!

Notion is a collaborative document-editing platform. We use it internally for lots of things, including style guides for authors of W&B-related content.

neomatrix369 commented 4 years ago

If you need any help with Feature Importances created using W&B and a few of the models types/frameworks, please tag me along. I have created a few PRs on the wandb/client repo for this and also linked them to the respective notebooks.

Let me know if you like me to share these and if they would add value to your current work.

charlesfrye commented 3 years ago

The core set of colabs has now been integrated (πŸŽ‰), so I'm going to close this issue and open issues for specific colabs that need to be added or edited.