tensorflow / adanet

Fast and flexible AutoML with learning guarantees.
https://adanet.readthedocs.io
Apache License 2.0
3.47k stars 529 forks source link

Evaluation issue using TPUEstimator #156

Open nicholasbreckwoldt opened 4 years ago

nicholasbreckwoldt commented 4 years ago

Running into an issue when using Adanet TPUEstimator. Say, for example, the estimator is configured with max_iteration_steps=500 and it is desired to evaluate the model's performance during training after every 100 training steps (i.e. steps_per_evaluation=100) for 2 complete Adanet iterations.

To achieve this, estimator.train(max_steps, train_input) followed by estimator.evaluate(eval_input) are run in a loop, while incrementing max_steps by steps_per_evaluation number of steps at the end of each loop, until max_steps=1000 is reached (i.e. corresponding to 2 complete Adanet iterations)

When running in local mode (i.e. use_tpu=False), training proceeds as expected. That is, training proceeds for 2 complete Adanet iterations (i.e. steps 0 to 500 for the first iteration and steps 500 to 1000 for the second iteration, with evaluation every 100 steps). However, when running on CloudTPU (i.e. use_tpu=True), training reaches max_steps=1000 without ever progressing to a second iteration.

On the other hand, a single call of estimator.train(max_steps=1000, train_input) using CloudTPU without the estimator.evaluate results in 2 complete Adanet iterations as expected. This makes me think the issue lies with the evaluation call? What could the issue be? If this is a TPUEstimator related issue, am I then constrained to the standard Estimator if I want this kind of train-evaluation loop configuration?

cweill commented 3 years ago

@nicholasbreckwoldt: We just released adanet=0.9.0 which includes better TPU, and TF 2 support. Please try installing it, and let us know if it resolves your issue.

nicholasbreckwoldt commented 3 years ago

@cweill Thanks for the update! I am running into a new issue with the upgrade to TF 2.2 and adanet==0.9.0 which has so far prevented me from establishing whether the above evaluation issue has been resolved. I've added a description of this new issue (#157).