ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.18k stars 1.19k forks source link

BERT-Encoder: Icorrect Jaccard calculation and differences from v0.2.1 to v0.3 #1002

Closed donfour10 closed 3 weeks ago

donfour10 commented 4 years ago

Describe the bug After the update to Ludwig version 0.3 I'm struggeling to define a similar model as is version 0.2.1. I am trying to do a multi-label classification on text with the bert encoder.

Additionally I think that the jaccard calculation of ludwig is incorrect. When I calculate it manually I get a completely smaller numbers than the numbers calculated by ludwig. (We rebuild from github after seeing https://github.com/uber/ludwig/issues/973 hoping the issue would resolve but it seemingly did not)

To Reproduce First I want to show the model definition I described in version 0.2.1: model_definition = { 'input_features': [ {'name': 'translated_text', 'type': 'text', 'encoder': 'bert', 'config_path': '/analyst/ludwig_experiments/uncased_L-12_H-768_A-12 (1)/bert_config.json', 'checkpoint_path': '/analyst/ludwig_experiments/uncased_L-12_H-768_A-12 (1)/bert_model.ckpt', 'preprocessing': {'word_tokenizer': 'bert', 'word_vocab_file': '/analyst/ludwig_experiments/uncased_L-12_H-768_A-12 (1)/vocab.txt', 'padding_symbol': '[PAD]', 'unknown_symbol': '[UNK]'} } ], 'output_features': [ {'name': 'target_category', 'type': 'set', 'threshold': 0.30} ], 'training': {'batch_size': 8, 'learning_rate': 0.00002} }

I tried several different things to reproduce this model in version 0.3. The big problems I have are, that I don't know how to use the checkpoint_path. Even the config-path throws an error when I use it as 'pretrained_model_name_or_path' (To fix that I added "model_type": "bert" in the config.json of the downloaded pretrained-model).

Model definition in version 0.3: I commented some code-lines becasue i tried it with them and without them. But nothing really changed. `model_definition = { 'input_features': [ {'name': 'translated_text', 'type': 'text', 'encoder': 'bert', 'pretrained_model_name_or_path': 'bert-large-uncased',

'trainable': True,

    'preprocessing': {
        # 'word_vocab_file': '/analyst/ludwig_experiments/uncased_L-12_H-768_A-12 (1)/vocab.txt',
        'word_tokenizer': 'bert',
        'padding_symbol': '[PAD]',
        'unknown_symbol': '[UNK]'
    }
    }
],
'output_features': [
    {'name': 'target_category',
    'type': 'set',
    'threshold': 0.30}
],
'training': {'batch_size': 8, 'learning_rate': 0.00002}

}`

Expected behavior A result which is similiar to the result wich produced in v0.2.1. F1-Score has fallen from 48 to 11. Jaccard-Similarity incresed from 32 to 52 but thats wrong. I manually calculated the jaccard and it was 8 instead of 52!

Environment (please complete the following information):

Additional context So I have several questions or assumptions: Maybe the model validate on the incorrect jaccard numbers? (I don't think so but it would explain why the actual result can't get similar my result with v0.2.1) How can I assign my downloaded model from https://github.com/google-research/bert without using the huggingface path and model? (Even when it should be the same. I can assign the config.json (but only when i add something) and the vocab.txt but not the .ckpt-file) Is it even possible?

I would be very thankful for some hints how I can ensure a similar configuration from v0.2.1 (tf1) to v0.3 (tf2). I am definitely a little lost here and don't want to downgrade ludwig and work with old version in the future.

Example from training_statistics.json v0.3: image The jaccard value is stuck directly after the first epoch. Even though the number is not correcrt.

Old traning_statistics.json v0.2.1: image Here the Jaccard values were correct and the training went through properly.

w4nderlust commented 4 years ago

@ donfour10 we are going to release a v0.3.1 that fixes some of thsoe issues. In particular we already fixed the calculation of the jaccard score, and we improved tokenization and default parameters for the text encoders. You can try it already by installing from master. Regarding the changes, now we only use the huggingface models. You don't need to specify config_path, tokenizer, vocabulary etc, it's all done automatically under the hood. Let me know if with the code on master you cn actually reproduce your previous results.

donfour10 commented 3 years ago

we actually rebuilt from master about a week ago. Just to be sure we just rebuild again are trying to verify - i can probably give you additional results in 2 days

donfour10 commented 3 years ago

So i started a first run with the variable trainable: True in the model_definition and it produces completely the same stats as before with the same data after a few epochs.

I will put all variables to default in the model_definition to start a new experiment and will report back when i have the results.

If you have any ideas how I can solve this problem, I would be very interested in it.

w4nderlust commented 3 years ago

@donfour10 yes trainable: True by default is one of the changes we made, glad it solved the issue. So now you are getting the correct performance again, so what is the problem you still want to solve?

donfour10 commented 3 years ago

sorry I think I was not clear. The results are still very different (worse) than with the previous streamlit/tf version.

They are similiar to the results i shared in this issue.

w4nderlust commented 3 years ago

Got it. One thing that may explain the difference is preprocessing. Do you have a cached hdf5 from previous trainings? In case remove it and make Ludwig perform preprocessing again, because the tokenizers used are different and it can make quite a big difference.

As a side note, we are adding a feature that checks if the hdf5 cache files are "fresh" to avoid these kind of problems in the future: https://github.com/uber/ludwig/pull/1006

donfour10 commented 3 years ago

Sorry for the late answer.

I'm relatively sure that i don't have a hdf5 cached because it's default on false but I don't know how to check it. Is there a command I can use to check it?

w4nderlust commented 3 years ago

Cached hdf5 and meta.json files are created in the same directory of the original dataset, with the same name, so you have to check there.

In the meantime we fixed some issues with learning rate scheduling and and released v0.3.1. I suggst you update to that version before training again.

donfour10 commented 3 years ago

sorry to report back the issue remains the same. I removed all hdf5 files which were in the directory. we have also checked / tried various standard parameters that could somehow be different. if you have any other ideas, we would be very keen on trying them out. I can also share a sample set of the data / model with you, if you think it might help.

w4nderlust commented 3 years ago

This is very surprising, all other models I tested with the new encoders worked as well as before. Yes sharing something, even privately, would be great and would help me try to figure out what's the issue. Feel free to reach out privately.