<s> and end sentence tags in lattice decoding

senarvi / theanolm

TheanoLM is a recurrent neural network language modeling tool implemented using Theano

Apache License 2.0

81 stars 29 forks source link

<s> and end sentence tags in lattice decoding #43

Closed adelra closed 5 years ago

adelra commented 6 years ago

When decoding lattices theanolm adds a<s> and !SENT_END </s> to the sentences in lattices. For computing WER I have to remove these tags. Where are these tags set? Are they useful even? I presume they should be removed.

senarvi commented 6 years ago

Apparently GitHub has interpreted the tags as markup. Do you mean <s> and </s>?

adelra commented 6 years ago

I edited my comment.

senarvi commented 6 years ago

I think <s> and </s> are added automatically at the beginning and end of the lattice. I haven't seen !SENT_END </s> before. Different tools add a bit different tokens in the lattice for various reasons that are not necessarily useful for TheanoLM. Which tool did you use to generate the lattice? I would guess there's a node !SENT_END in the end of the lattice. Anyway I'm pretty sure you can just remove them. By the way, did you check what e.g. SRILM produces when decoding the lattice?

adelra commented 6 years ago

Oh, Then it's from Kaldi. Because my lattices were generated by Kaldi. I will check Kaldi's lattices in details to see how to remove them.

On another topic: wouldn't it be useful if we could somehow port Kaldi's compute-wer into TheanoLM to call it directly?

[kaldi-trunk]/src/bin/compute-wer

senarvi commented 6 years ago

But you have first converted the lattices to SLF, right? I don't suggest removing the !SENT_END nodes from the lattices, but removing them from the decoded text. Did you know that you can also directly decode Kaldi lattices?

Adding a simple tool for computing the WER shouldn't be too difficult, but you'll probably get more features by using something like SCLITE.

adelra commented 6 years ago

Yes, I was actually decoding Kaldi lattices straight from theanolm. Probably Kaldi lattice format has been changed or something such like that causes those tags to be inserted.

senarvi commented 6 years ago

Ok, I can't really remember but I'd be curious to see one such lattice.

adelra commented 6 years ago

You could take a look at IAM recipe in Kaldi. you first have to run it. egs/iam/v1/

If you were not able to run/find it, let me know, I'll send you a copy.

senarvi commented 5 years ago

@adelra sorry I forgot this issue. I don't have Kaldi installed. It would be easiest if you could send me the lattice and the Kaldi vocabulary. I should be able to reproduce the problem with any model, right?

adelra commented 5 years ago

Sure, I can send you the lattices but I think you'd have to do some pre-processing beforehand, for instance converting lattices to text mode etc. As you mentioned that you don't have Kaldi installed, it's probably better if I convert them and send them all to you, right? However, I'm busy with my thesis these days, I can send them sometime after this Wed.

senarvi commented 5 years ago

Yes, I would need the text format FST and vocabulary file. Thanks a lot, no hurry with this.

senarvi commented 5 years ago

Never mind. I was able to reproduce the problem with my lattice file. I'll look into this.

senarvi commented 5 years ago

The !SENT_END token came from the KaldiLattice class. I'm not sure why the final transition has been given this token, but it has to be related to the fact that some SLF lattices end in this token.

I changed it to None, meaning that the final transition doesn't produce any word. I'm pretty sure it's correct now, because we map all the special tokens (those that start with !) in an SLF lattice to None.

Thanks for reporting this problem @adelra. Do you mind pulling the latest changes and testing lattice decoding / rescoring with your data, to be certain that I didn't introduce new problems, since I'm not exactly sure what the idea with the !SENT_END was originally? :)