ldong1111 / GraphemeBERT

This is the source code of the paper "Neural grapheme-to-phoneme conversion with pretrained grapheme models
MIT License
44 stars 7 forks source link

Model usage #2

Closed ghost closed 1 year ago

ghost commented 1 year ago

Hello! I have recently came across this work and I am curious about the usage of the model. Can you give a demo for english language? Thank you!

fake-warrior8 commented 1 year ago

Hello! I have recently came across this work and I am curious about the usage of the model. Can you give a demo for english language? Thank you!

  1. You can first pre-train a English GBERT using the following command
python monolingual_GBERT_pretrain/GBERT_pretraining.py --pretraining language eng

This produces a vocabulary and a GBERT checkpoint. The tokenizer uses torchtext tokenizer. you can refer to the code of GBERT_pretraining.py to utilize GBERT for other usages.

  1. You can then train a GBERT based G2P model using the pre-trained GBERT checkpoint, such as GBERT-finetuning model or GBERT-attention model, the command is
python monolingual_G2P_model/GBERT_finetuning.py --hyperparameter

The hyperparameters are given in the Implement Details part of the Readme.md file, you can use our given hyperparameters to reproduce our experimental results, or tune these hyperparameters for your G2P dataset.

ghost commented 1 year ago

I understood your pre-training and fine-tuning recipes as they were very well-structured and concise. Using them, I successfully created the pytorch model for english. The problem I am facing now is in inferencing. As I am new to the field, I have taken help from some articles on how to load a pytorch model for inferencing ([https://stackoverflow.com/questions/49941426/attributeerror-collections-ordereddict-object-has-no-attribute-eval])

As you can see in the article's answer, for inferencing we require the class which defines the structure of the model. But in your fine-tuning training file, there are many classes. I chose Seq2Seq() for the same but it's giving the following error:

Traceback (most recent call last):
  File "model_test.py", line 10, in <module>
    model = Seq2Seq()
TypeError: __init__() missing 4 required positional arguments: 'decoder', 'trg_pad_idx', 'output_dim', and 'device'

This the code which I have written for inferencing:

import torch
from monolingual_G2P_model.GBERT_finetuning import Seq2Seq

PATH = './monolingual_medium_resource/torch_models/g2p-model_eng_without_g2p_dev_and_test_word_Transformer_with_BERT_pretrain_Encoder_batch_size256_gelu_hid_dim128_new_pretrain.pt'

model = Seq2Seq()
model.load_state_dict(torch.load(PATH))
model.eval()

Moreover, when I import the GBERT_finetuning file, it begins to train the model again as the code for training is global and not under any method. My ultimate use case is to generate a phonetic expansion for any given input word using the model which I have generated. Can you provide some suggestion regarding the same?

fake-warrior8 commented 1 year ago

I understood your pre-training and fine-tuning recipes as they were very well-structured and concise. Using them, I successfully created the pytorch model for english. The problem I am facing now is in inferencing. As I am new to the field, I have taken help from some articles on how to load a pytorch model for inferencing ([https://stackoverflow.com/questions/49941426/attributeerror-collections-ordereddict-object-has-no-attribute-eval])

As you can see in the article's answer, for inferencing we require the class which defines the structure of the model. But in your fine-tuning training file, there are many classes. I chose Seq2Seq() for the same but it's giving the following error:

Traceback (most recent call last):
  File "model_test.py", line 10, in <module>
    model = Seq2Seq()
TypeError: __init__() missing 4 required positional arguments: 'decoder', 'trg_pad_idx', 'output_dim', and 'device'

This the code which I have written for inferencing:

import torch
from monolingual_G2P_model.GBERT_finetuning import Seq2Seq

PATH = './monolingual_medium_resource/torch_models/g2p-model_eng_without_g2p_dev_and_test_word_Transformer_with_BERT_pretrain_Encoder_batch_size256_gelu_hid_dim128_new_pretrain.pt'

model = Seq2Seq()
model.load_state_dict(torch.load(PATH))
model.eval()

Moreover, when I import the GBERT_finetuning file, it begins to train the model again as the code for training is global and not under any method. My ultimate use case is to generate a phonetic expansion for any given input word using the model which I have generated. Can you provide some suggestion regarding the same?

  1. To load the Se2seq model, You are required to add this part code to your code, which will initialize the model parameters.
  2. For loading the GBERT_fintuning module, you need to reorganize the code of GBERT_finetuning.py, put the preprocessing and checkpoint loading code into a init method, and put the inference code in some function. Then you can import this module for inference. Because our code is mainly used for academic purpose but not for a real application scenario.
  3. An advice for phonetic expansion utilizing our model is to pre-train our GBERT with larger dictionary or train our GBERT_finetuning model with larger G2P dataset, so as to improve the open domain G2P performance. Since our model is only pre-trained with ~10k dictionary and trained with 8k G2P dataset, you may find our model not work well for any given input words.
ghost commented 1 year ago

This is really helpful. Thanks a lot :)