prosodylab / Prosodylab-Aligner

Python interface for forced audio alignment using HTK and SoX
http://prosodylab.org/tools/aligner/
MIT License
331 stars 77 forks source link

Formatting error in dictionary #72

Closed bwang482 closed 4 years ago

bwang482 commented 6 years ago

I have used the suggested commands below for dealing with the OOV issue:

$ ./sort.py eng.dict OOV.txt > tmp; 
$ mv tmp eng.dict

However, I am getting the error below: Formatting error in dictionary '/Users/bowang/Tools/Prosodylab-Aligner/eng.dict' (ln. 1).

kylebgorman commented 6 years ago

Presumably there is an formatting error in your OOV.txt. If you'd like us to take a look please post it somewhere so we can replicate the issue.

On Mon, Aug 6, 2018 at 2:14 PM bluemonk482 notifications@github.com wrote:

I have used the suggested commands below for dealing with the OOV issue:

$ ./sort.py eng.dict OOV.txt > tmp; $ mv tmp eng.dict

However, I am getting the error below: Formatting error in dictionary '/Users/bowang/Tools/Prosodylab-Aligner/eng.dict' (ln. 1).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/prosodylab/Prosodylab-Aligner/issues/72, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJuOZstZ-93jVkTHZn0J-qlnC6rnFH-ks5uOId2gaJpZM4Vw2jY .

bwang482 commented 6 years ago

Thanks @kylebgorman

Here is the Dropbox link to the OOV.txt and eng.dict:

https://www.dropbox.com/s/a7rro8is1tw774h/OOV.txt?dl=0 https://www.dropbox.com/s/yyut5jlfb4f3ev4/eng.dict?dl=0

kylebgorman commented 6 years ago

The OOV file looks like you haven't actually tokenized or case-folded the data as it expects. You need to make sure you're removing punctuation marks and ignoring case in your lab files. For instance for the first sentence in this message you would want the label file to read:

 THE OOV FILE LOOKS LIKE YOU HAVEN'T ACTUALLY TOKENIZED OR CASE FOLDED

THE DATA AS IT EXPECTS

On Mon, Aug 6, 2018 at 7:00 PM bluemonk482 notifications@github.com wrote:

Thanks @kylebgorman https://github.com/kylebgorman

Here is the Dropbox link to the OOV.txt and eng.dict:

https://www.dropbox.com/s/a7rro8is1tw774h/OOV.txt?dl=0 https://www.dropbox.com/s/yyut5jlfb4f3ev4/eng.dict?dl=0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prosodylab/Prosodylab-Aligner/issues/72#issuecomment-410879260, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJuOZik4LnpSCJvmaZWTlARfdM1Bjgtks5uOMqJgaJpZM4Vw2jY .

bwang482 commented 6 years ago

@kylebgorman Thanks! I have tokenized, upper-cased, and removed punctuations (except apostrophe). Now I do have a few OOV words including numbers (also even "IT'S" is a OOV). I have followed the steps provided again to add this number into the lang.dict.

It is giving the same formatting error for the lang.dict. The issue is the provided way of editing the lang.dict file somehow changes its format?

kylebgorman commented 6 years ago

You need to ensure that the line you add has the same formatting as other lines. I'm not sure what else to tell you. I do this simply by typing in my preferred text editor, correcting any errors manually. Is the expected format clear?

If your text editor makes this hard to do, try another one.

On Sun, Aug 12, 2018, 4:51 PM bluemonk482 notifications@github.com wrote:

@kylebgorman https://github.com/kylebgorman Thanks! I have tokenized, upper-cased, and removed punctuations (except apostrophe). Now I have only one OOV token which is a 5 digits number. I have followed the steps provided again to add this number into the lang.dict.

It is giving the same formatting error for the lang.dict. The issue is the provided way of editing the lang.dict file somehow changes its format?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prosodylab/Prosodylab-Aligner/issues/72#issuecomment-412380208, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJuOWAKJKDn0H9MHwMp86zSdtlAoSLIks5uQL9vgaJpZM4Vw2jY .

bwang482 commented 6 years ago

@kylebgorman Thanks. Do you edit the lang.dict file by simply adding the OOV words on the top, and one OOV word per line?

I have tried the following:

1), the provided code for dealing with OOV words by editing the dict file, gives formatting error. 2), I have tried editing the dict file using texteditor and sublime (adding OOV words on top). Again I am getting formatting error.

I do appreciated your help @kylebgorman . But I believe I am editing the lang.dict file the wrong way here..

kylebgorman commented 6 years ago

On Mon, Aug 13, 2018 at 9:34 AM bluemonk482 notifications@github.com wrote:

@kylebgorman https://github.com/kylebgorman Thanks. Do you edit the lang.dict file by simply adding the OOV words on the top, and one OOV word per line?

I have tried the following:

1), the provided code for dealing with OOV words by editing the dict file, gives formatting error. 2), I have tried editing the dict file using texteditor and sublime (adding OOV words on top). Again I am getting formatting error.

In both cases you have to make sure to also sort the dictionary as described in the README.

kylebgorman commented 4 years ago

Closing for inactivity.