Open 008karan opened 5 years ago
please try again with the latest version of the library.
I have tried using the latest version still the same issue. I tried to append the metric list but then was getting the error.
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
60.00% [6/10 1:05:32<43:41]
47.85% [100/209 05:03<05:31]
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00]
100.00% [24/24 00:20<00:00] ```
Had the same issue of no metrics being printed at all, seems like it's because the default setting of the logger is to only print warning messages, not info messages. If you are using the root logger (e.g. logger = logging.getLogger()
), before you create the object, run this line:
logging.basicConfig(level=logging.NOTSET)
If you are defining a custom logger yourself (e.g. logger = logging.getLogger("my-logger")
), before you create the object, run this line:
logging.root.setLevel(logging.NOTSET)
Now the training process, will print out the loss as well as any other metrics you passed to the learner object.
Alternatively, you can always view the training process either during or afterwards, using tensorboard
. The training process creates a folder called tensorboard
with all the events files in there.
@amin-nejad Thanks for the suggestion. I tried using your method still same issue. Added your line before creating logger
object.
from fast_bert.metrics import accuracy
import logging
logging.basicConfig(level=logging.NOTSET)
logger = logging.getLogger()
device_cuda = torch.device("cuda")
#metrics = [{'name': 'accuracy', 'function': accuracy}]
metrics = []
metrics.append({'name': 'accuracy_thresh', 'function': accuracy_thresh})
metrics.append({'name': 'roc_auc', 'function': roc_auc})
metrics.append({'name': 'fbeta', 'function': fbeta})
metrics.append({'name': 'accuracy_single', 'function': accuracy_multilabel})
learner = BertLearner.from_pretrained_model(
databunch,
pretrained_path='bert-base-uncased',
metrics=metrics,
device=device_cuda,
logger=logger,
output_dir=MODEL_PATH,
finetuned_wgts_path=None,
warmup_steps=50,
multi_gpu=True,
is_fp16=True,
multi_label=True,
logging_steps=50)
Don't know what's going wrong.
Also, can you tell how to use tensorboard here? I tried adding it in model.fit() method but getting error. I think I am missing something. Can you help?
Not sure what the problem is then, changing the logging as I suggested above worked for me.
Re tensorboard, you don't really need to do anything. Once you begin training by calling learner.fit()
, in the directory you have specified as your output_dir
(MODEL_PATH in your case), you should see a subdirectory called tensorboard
. While the model is training, it will constantly update a file in there whose name will be something like events.out.tfevents.1565791914.vm1
.
On your terminal, all you need to do is change directory to the tensorboard directory and then run tensorboard --logdir=.
. Ensure you have tensorflow
installed in the environment you are using, it should automatically come with tensorboard when you install it. If you are using a virtual machine, you also need to ensure you have opened port 6006.
You can find more info here
earlier I was able to see accuracy and f beta score while training the model but now I can't see anything. Model just completes its epoch and not printing anything. any suggestions?