utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.86k stars 341 forks source link

cant see any metric while training #29

Open 008karan opened 5 years ago

008karan commented 5 years ago

earlier I was able to see accuracy and f beta score while training the model but now I can't see anything. Model just completes its epoch and not printing anything. any suggestions?

kaushaltrivedi commented 5 years ago

please try again with the latest version of the library.

008karan commented 5 years ago

I have tried using the latest version still the same issue. I tried to append the metric list but then was getting the error.


keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
 60.00% [6/10 1:05:32<43:41]
 47.85% [100/209 05:03<05:31]
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00]
 100.00% [24/24 00:20<00:00] ```
amin-nejad commented 5 years ago

Had the same issue of no metrics being printed at all, seems like it's because the default setting of the logger is to only print warning messages, not info messages. If you are using the root logger (e.g. logger = logging.getLogger()), before you create the object, run this line:

logging.basicConfig(level=logging.NOTSET)

If you are defining a custom logger yourself (e.g. logger = logging.getLogger("my-logger")), before you create the object, run this line:

logging.root.setLevel(logging.NOTSET)

Now the training process, will print out the loss as well as any other metrics you passed to the learner object.

Alternatively, you can always view the training process either during or afterwards, using tensorboard. The training process creates a folder called tensorboard with all the events files in there.

008karan commented 5 years ago

@amin-nejad Thanks for the suggestion. I tried using your method still same issue. Added your line before creating logger object.


from fast_bert.metrics import accuracy
import logging
logging.basicConfig(level=logging.NOTSET)
logger = logging.getLogger()
device_cuda = torch.device("cuda")
#metrics = [{'name': 'accuracy', 'function': accuracy}]
metrics = []
metrics.append({'name': 'accuracy_thresh', 'function': accuracy_thresh})
metrics.append({'name': 'roc_auc', 'function': roc_auc})
metrics.append({'name': 'fbeta', 'function': fbeta})
metrics.append({'name': 'accuracy_single', 'function': accuracy_multilabel})
learner = BertLearner.from_pretrained_model(
                        databunch,
                        pretrained_path='bert-base-uncased',
                        metrics=metrics,
                        device=device_cuda,
                        logger=logger,
                        output_dir=MODEL_PATH,
                        finetuned_wgts_path=None,
                        warmup_steps=50,
                        multi_gpu=True,
                        is_fp16=True,
                        multi_label=True,
                        logging_steps=50)

Don't know what's going wrong. 
Also, can you tell how to use tensorboard here? I tried adding it in model.fit() method but getting error. I think I am missing something. Can you help?
amin-nejad commented 5 years ago

Not sure what the problem is then, changing the logging as I suggested above worked for me.

Re tensorboard, you don't really need to do anything. Once you begin training by calling learner.fit(), in the directory you have specified as your output_dir (MODEL_PATH in your case), you should see a subdirectory called tensorboard. While the model is training, it will constantly update a file in there whose name will be something like events.out.tfevents.1565791914.vm1.

On your terminal, all you need to do is change directory to the tensorboard directory and then run tensorboard --logdir=.. Ensure you have tensorflow installed in the environment you are using, it should automatically come with tensorboard when you install it. If you are using a virtual machine, you also need to ensure you have opened port 6006.

You can find more info here