ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
657 stars 195 forks source link

Cannot replicate paper results #44

Open Davidobot opened 5 years ago

Davidobot commented 5 years ago

Downloading the pre-trained INTERSPEECH-T-BRNN.pcl model from here and then running it on the TED talk data does not yield the reported overall F1 score of 63.1. Is this a reason for this - or am I doing something wrong with the training data?

Below are the obtained metrics, from error_calculator.py

Punctuation P R F1
COMMA 43.9 56.0 49.2
PERIOD 61.4 70.4 65.6
QUESTIONMARK 50.0 62.2 55.4
OVERALL 52.4 63.5 57.4

Here is a sample of the test data:

i'm a savant ,COMMA or more precisely ,COMMA a high-functioning autistic savant .PERIOD it's a rare condition .PERIOD and rarer still when accompanied ,COMMA as in my case ,COMMA by self-awareness and a mastery of language .PERIOD very often when i meet someone and they learn this about me there's a certain kind of awkwardness .PERIOD i can see it in their eyes .PERIOD they want to ask me something .PERIOD and in the end ,COMMA quite often ,COMMA the urge is stronger than they are and they blurt it out ,COMMA if i give you my date of birth ,COMMA can you tell me what day of the week i was born on ?QUESTIONMARK or they mention cube roots or ask me to recite a long number or long text .PERIOD

And here is a sample of the output:

i'm a savant or more precisely a high-functioning autistic savant it's ,COMMA a rare condition and rarer still when accompanied as in my case by self-awareness and a mastery of language ,COMMA very often when i meet someone and they learn this about me ,COMMA there's a certain kind of awkwardness .PERIOD i can see it in their eyes .PERIOD they want to ask me something .PERIOD and in the end ,COMMA quite often ,COMMA the urge is stronger than they are .PERIOD and they blurt it out .PERIOD if i give you my date of birth ,COMMA can you tell me what day of the week i was born on or they mention cube roots or ask me to recite a long number or long text ?QUESTIONMARK

As a side note, attempting to run INTERSPEECH-T-BRNN-pre.pcl gives a ValueError: cannot reshape array of size 10000 into shape (200,256) error, which does not occur when running INTERSPEECH-T-BRNN.pcl or Demo-Europarl-EN.pcl.

ottokart commented 5 years ago

Hi,

are you using the https://github.com/ottokart/punctuator2/releases/tag/v1.0 or the latest version?

Davidobot commented 5 years ago

I'm using the latest version, pulled from GitHub directly.