Pretrained punctuation model produces mangled output #4

Closed adam-faulkner closed 4 years ago

adam-faulkner commented 4 years ago

Hi, when running the pretrained biLSTM for punctuation-restoration, I get the following output from inferer.infer_sentence("hi how are you"):

Predicted Label: .hI. How.

and inferer.infer_from_file("./data/input.txt", out_file="./data/output.txt") produces the following output.txt:

hi how are you,.hI. How.
i am fine thanks,"I, am fine thanks."

Does the pretrained model not work?

Tested on both OSX and Linux

plkmo commented 4 years ago

What are the params you've used for For default, I have tested on my side and its fine. I have re-uploaded files again to be sure.

v-iashin commented 4 years ago

Tested it today on my setup. I can confirm the same results as @adam-faulkner

from nlptoolkit.utils.config import Config
from nlptoolkit.punctuation_restoration.trainer import train_and_fit
from nlptoolkit.punctuation_restoration.infer import infer_from_trained

config = Config(task='punctuation_restoration') # loads default argument parameters as above
config.data_path = "./data/train.tags.en-fr.en" # sets training data path
config.batch_size = 32 = 5e-5 # change learning rate
config.model_no = 1 # sets model to PuncLSTM
inferer = infer_from_trained(config)
inferer.infer_from_file(in_file="./data/input.txt", out_file="./data/output.txt")

plkmo commented 4 years ago

Thanks for the detailed info. I have identified the problem now. Its with Config(task='punctuation_restoration'). The max encoder & decoder lengths were 200, but should be set to 80 for the pre-trained puncLSTM to work well. Have fixed with update to set them to 80 by default.

v-iashin commented 4 years ago

I tried and it worked for the sample data as expected.

adam-faulkner commented 4 years ago

Works as expected now, thanks for the fix!