google-deepmind / deepmind-research

This repository contains implementations and illustrative code to accompany DeepMind publications
Apache License 2.0
13.06k stars 2.57k forks source link

nfnets: training error #174

Open purvang3 opened 3 years ago

purvang3 commented 3 years ago

First of all, thank you for great publish nfnets. I have started deeging deep in to implementation, where I have some questions.

Unfortunately I am not able to run experiment.py. I am getting following error. I am running on just one gpu for testing.

Screen Shot 2021-02-18 at 6 40 53 PM

when I run test.py using fake data, it is working without any error.

Thank you

nss-ysasaki commented 3 years ago

With a bit of digging around I managed to get past the error above.

  1. Add the following line to experiment.py
    if __name__ == '__main__':
    FLAGS(sys.argv) # <- add this line
    flags.mark_flag_as_required('config')
    platform.main(Experiment, sys.argv[1:])
  2. Add the following lines in the definition of get_config(), in experiment.py:

    config.save_checkpoint_interval = 60
    config.eval_specific_checkpoint_dir = ''
    config.checkpoint_dir = '/path/' # <- add this (modify /path/ appropriately)
    config.train_checkpoint_all_hosts = True # <- and this
    
    return config
  3. Run experiment.py with --config argument, as follows:
    python nfnets/experiment.py --config nfnets/experiment.py

The published version of deepmind/jaxline is outdated, perhaps?

PS: Even with this workaround, training halts with TypeError, but that's yet another issue...