Inconsistent results when loading trained NER model for predictions

secsilm commented 11 months ago

Describe the bug We recently trained a NER model using charlm, following the instructions here.

python3 -m stanza.utils.training.run_ner bn_daffodil --charlm oscar --save_name bn_daffodil_charlm.pt

Then I noticed that every time I reload the model for prediction, the results are always different for the same input. Here is the code:

import stanza
content = """
Kebakaran TPA Suwung Belum Berakhir, Sampah Tak Bisa Masuk, Badung Manfaatkan 2 TPST dan 29 TPS3R. 
"""
nlp_id = stanza.Pipeline(
    lang="id",
    processors="tokenize,ner",
    ner_model_path="saved_models/ner/id_sample_nertagger.pt",
    ner_forward_charlm_path="saved_models/charlm/id_test_forward_charlm.pt",
    ner_backward_charlm_path="saved_models/charlm/id_test_backward_charlm.pt",
    logging_level='DEBUG',
    use_gpu=False
)
doc = nlp_id(content)
locations = [
    ent.text for sent in doc.sentences for ent in sent.ents if ent.type == "PLC"
]
print(len(locations), locations)

The different results between two runs: (1, ['TPA Suwung']) and (2, ['TPA Suwung', 'Badung']).

Here is the log:

2023-10-20 14:51:11 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
2023-10-20 14:51:11 DEBUG: Downloading resource file from https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 4.75MB/s]                    
2023-10-20 14:51:12 DEBUG: Loading resource file...
2023-10-20 14:51:12 DEBUG: Processing parameter "processors"...
2023-10-20 14:51:12 WARNING: Language id package default expects mwt, which has been added
2023-10-20 14:51:12 DEBUG: Found tokenize: gsd.
2023-10-20 14:51:12 DEBUG: ner: default is not officially supported by Stanza, loading it anyway.
2023-10-20 14:51:12 DEBUG: Found mwt: gsd.
2023-10-20 14:51:12 DEBUG: Found dependencies [] for processor tokenize model gsd
2023-10-20 14:51:12 DEBUG: Found dependencies [] for processor mwt model gsd
2023-10-20 14:51:12 DEBUG: Found dependencies [] for processor ner model default
2023-10-20 14:51:12 DEBUG: Downloading these customized packages for language: id (Indonesian)...
=======================
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |
=======================

2023-10-20 14:51:12 DEBUG: File exists: stanza_resources/id/tokenize/gsd.pt
2023-10-20 14:51:12 DEBUG: File exists: stanza_resources/id/mwt/gsd.pt
2023-10-20 14:51:12 INFO: Loading these models for language: id (Indonesian):
=======================================
| Processor | Package                 |
---------------------------------------
| tokenize  | gsd                     |
| mwt       | gsd                     |
| ner       | t...rtagger.pt |
=======================================

2023-10-20 14:51:12 INFO: Using device: cpu
2023-10-20 14:51:12 INFO: Loading: tokenize
2023-10-20 14:51:12 DEBUG: With settings: 
2023-10-20 14:51:12 DEBUG: {'model_path': 'stanza_resources/id/tokenize/gsd.pt', 'lang': 'id', 'mode': 'predict'}
2023-10-20 14:51:12 INFO: Loading: mwt
2023-10-20 14:51:12 DEBUG: With settings: 
2023-10-20 14:51:12 DEBUG: {'model_path': 'stanza_resources/id/mwt/gsd.pt', 'lang': 'id', 'mode': 'predict'}
2023-10-20 14:51:12 DEBUG: Building an attentional Seq2Seq model...
2023-10-20 14:51:12 DEBUG: Using a Bi-LSTM encoder
2023-10-20 14:51:12 DEBUG: Using soft attention for LSTM.
2023-10-20 14:51:12 DEBUG: Finetune all embeddings.
2023-10-20 14:51:12 INFO: Loading: ner
2023-10-20 14:51:12 DEBUG: With settings: 
2023-10-20 14:51:12 DEBUG: {'model_path': 'saved_models/ner/id_sample_nertagger.pt', 'dependencies': [{}], 'forward_charlm_path': 'saved_models/charlm/id_test_forward_charlm.pt', 'backward_charlm_path': 'saved_models/charlm/id_test_backward_charlm.pt', 'lang': 'id', 'mode': 'predict'}
2023-10-20 14:51:12 DEBUG: Loading saved_models/ner/id_sample_nertagger.pt with pretrain None, forward charlm saved_models/charlm/id_test_forward_charlm.pt, backward charlm saved_models/charlm/id_test_backward_charlm.pt
2023-10-20 14:51:12 DEBUG: Old model format detected.  Updating to the new format with one column of tags
Some weights of the model checkpoint at cahya/roberta-base-indonesian-1.5G were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at cahya/roberta-base-indonesian-1.5G and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2023-10-20 14:51:15 INFO: Done loading processors!

I have done some research, and here are my findings:

If you add the following lines at the beginning of the program, then the results will be the same every time.:
```
import torch
torch.manual_seed(0)
```
Once the model is loaded into memory, the predictions will be the same regardless of how many times you make predictions.
I saved the nlp_id.processors['ner'].trainer.model.state_dict() twice and found that the differing keys are word_emb.weight and bert_model.pooler.dense.weight.
The code responsible for the warning message "Some weights of RobertaModel were not initialized from the model checkpoint at cahya/roberta-base-indonesian-1.5G and are newly initialized" can be found at: https://github.com/stanfordnlp/stanza/blob/c65b66969469fd29b02ba972830087e4007c6b54/stanza/models/common/bert_embedding.py#L52.

In conclusion, I speculate that during model loading, the weights of the pooler are not being loaded, causing them to be randomly initialized each time the model is loaded. Setting the random seed manually ensures consistent results. However, I am unable to further investigate and fix this issue. I need you guys help. Thanks!

To Reproduce Run the code above multitimes. Maybe you should replace the model.

Expected behavior Consistent results between runs.

Environment (please complete the following information):

OS: Ubuntu
Python version: Python 3.11.4
Stanza version: 1.6.1

Additional context Here are the state_dict keys for the three models:

id_sample_nertagger.pt：

['taggerlstm_h_init',
 'taggerlstm_c_init',
 'delta_emb.weight',
 'input_transform.weight',
 'input_transform.bias',
 'taggerlstm.lstm.weight_ih_l0',
 'taggerlstm.lstm.weight_hh_l0',
 'taggerlstm.lstm.bias_ih_l0',
 'taggerlstm.lstm.bias_hh_l0',
 'taggerlstm.lstm.weight_ih_l0_reverse',
 'taggerlstm.lstm.weight_hh_l0_reverse',
 'taggerlstm.lstm.bias_ih_l0_reverse',
 'taggerlstm.lstm.bias_hh_l0_reverse',
 'tag_clf.weight',
 'tag_clf.bias',
 'crit._transitions']

id_test_forward_charlm.pt and id_test_backward_charlm.pt：

['charlstm_h_init',
 'charlstm_c_init',
 'char_emb.weight',
 'charlstm.lstm.weight_ih_l0',
 'charlstm.lstm.weight_hh_l0',
 'charlstm.lstm.bias_ih_l0',
 'charlstm.lstm.bias_hh_l0',
 'decoder.weight',
 'decoder.bias']

AngledLuffa commented 11 months ago

Thanks for the detailed writeup. Are you able to make the model available? It will be easier to debug with a known example exhibiting the behavior.

AngledLuffa commented 11 months ago

Alternatively, if you can share your NER conversion code, that would also go a long way towards working it out. I take it you were using this Roberta model?

cahya/roberta-base-indonesian-1.5G

AngledLuffa commented 11 months ago

this issue claims those should not be relevant:

https://github.com/huggingface/transformers/issues/6193

i will have to dig into it a bit more after morning siesta

AngledLuffa commented 11 months ago

I wanted to tell you that you're crazy, but apparently not

>>> from stanza.models.common.bert_embedding import load_bert
>>> m, t = load_bert("cahya/roberta-base-indonesian-1.5G")
>>> m2, t2 = load_bert("cahya/roberta-base-indonesian-1.5G")
>>> for n, p in m.named_parameters():
...   p2 = m2.get_parameter(n)
...   if not torch.allclose(p, p2):
...     print(n)
...
pooler.dense.weight

But then I'm not sure it has a noticeable effect:

>>> from stanza.models.common.bert_embedding import extract_bert_embeddings
>>> model_name = "cahya/roberta-base-indonesian-1.5G"                                                                   
>>> r = extract_bert_embeddings(model_name, t, m, [["TPA", "Suwung", "Badung"]], m.device, True)[0]
>>> r2 = extract_bert_embeddings(model_name, t2, m2, [["TPA", "Suwung", "Badung"]], m.device, True)[0]
>>> torch.allclose(r, r2)
True

So I think that ultimately I need more to reproduce this, specifically either the model itself or a prescription for generating the dataset (the latter would be quite useful, actually, as it would let us add Indonesian NER to Stanza)

secsilm commented 11 months ago

I wanted to tell you that you're crazy, but apparently not
>>> from stanza.models.common.bert_embedding import load_bert
>>> m, t = load_bert("cahya/roberta-base-indonesian-1.5G")
>>> m2, t2 = load_bert("cahya/roberta-base-indonesian-1.5G")
>>> for n, p in m.named_parameters():
...   p2 = m2.get_parameter(n)
...   if not torch.allclose(p, p2):
...     print(n)
...
pooler.dense.weight
But then I'm not sure it has a noticeable effect:
>>> from stanza.models.common.bert_embedding import extract_bert_embeddings
>>> model_name = "cahya/roberta-base-indonesian-1.5G"                                                                   
>>> r = extract_bert_embeddings(model_name, t, m, [["TPA", "Suwung", "Badung"]], m.device, True)[0]
>>> r2 = extract_bert_embeddings(model_name, t2, m2, [["TPA", "Suwung", "Badung"]], m.device, True)[0]
>>> torch.allclose(r, r2)
True
So I think that ultimately I need more to reproduce this, specifically either the model itself or a prescription for generating the dataset (the latter would be quite useful, actually, as it would let us add Indonesian NER to Stanza)

This is crazy. different weights but same embeddings? Is this due to floating-point precision issues? Or extract_bert_embeddings not using pooler layer?

For model, I have mailed you.

secsilm commented 11 months ago

Alternatively, if you can share your NER conversion code, that would also go a long way towards working it out. I take it you were using this Roberta model? cahya/roberta-base-indonesian-1.5G

I didn't specify this, and I don't know where it came from.

AngledLuffa commented 11 months ago

Good question. If you're using that as the default, perhaps there is an older version of Stanza or something. The current version of stanza wants to use this as the default:

indolem/indobert-base-uncased

and you can specify that with

--bert_model indolem/indobert-base-uncased

You might have better luck with that transformer. I checked the parameters in that one, and none of them are different between instances of the model

On Fri, Oct 20, 2023 at 6:56 PM Alan Lee @.***> wrote:

Alternatively, if you can share your NER conversion code, that would also go a long way towards working it out. I take it you were using this Roberta model? cahya/roberta-base-indonesian-1.5G

I didn't specify this, and I don't know where it came from.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1301#issuecomment-1773600124, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLA55NDKZRDMWDUSH3YAMTWRAVCNFSM6AAAAAA6ISG3SGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZTGYYDAMJSGQ . You are receiving this because you commented.Message ID: @.***>

AngledLuffa commented 11 months ago

Did it ever work to try with a different transformer?

AngledLuffa commented 10 months ago

Well, never heard back about using a different transformer, but I think it should help results compared to having the pooler layers randomly initialized. I filed an issue on HF to see if they'll be able to update the model with fixed layers. In the meantime, there are several other Indonesian transformer models available, and I suggest using one of those instead. (Alternatively, you could always fine tune this transformer, and then at least the random initialization will be fine tuned a bit.)

https://huggingface.co/cahya/roberta-base-indonesian-1.5G/discussions/2

AngledLuffa commented 7 months ago

Is this addressed?

The next version will include finetuning code and peft finetuning for several different annotators, so if you want to use the Cahya transformer instead of IndoBert, it will work as long as you do that finetuning.

Perhaps there could be some automatic finetuning / saving of these unpopulated tensors when training, even if finetuning is off, but that is a larger project and probably a rather thankless one considering there are multiple other solutions.

stanfordnlp / stanza

Inconsistent results when loading trained NER model for predictions #1301