train about task speech_to_text_tagged

Crabbit-F commented 1 year ago

using the command in readme，i can't get the same bleu on task speech_to_text_tagged.and i had add ner tag in dict.is it my dict or data preprocess wrong.can you share about the dict of mustc dataset?thanks a lot.

mgaido91 commented 1 year ago

I am not sure what you are doing and what problem you are experiencing. Btw, here you can find the dictionaries for en-es:

Crabbit-F commented 1 year ago

thanks a lot😭

mgaido91 commented 1 year ago

no problem, let me know if you have more issues or need any help.

mgaido91 commented 1 year ago

I am closing it for now, feel free to reopen if you need anything else. Thanks.

Crabbit-F commented 1 year ago

using the command in readme and the dict by you upload，i can't get the bleu score on task speech_to_text_tagged in en-es. 微信截图_20230711015956

and this is my trainssh: python train.py datasetdir \ --train-subset train_st --valid-subset dev_st \ --save-dir datasetdir \ --num-workers 2 --max-update 100000\ --max-tokens 15000 \ --user-dir examples/speech_to_text \ --task speech_to_text_ctc_tagged --config-yaml config_st.yaml \ --criterion ctc_multi_loss --underlying-criterion cross_entropy_with_tags --label-smoothing 0.1 --tags-loss-weight 1.0 \ --arch conformer_with_tags \ --ctc-encoder-layer 4 --ctc-weight 0.5 --ctc-compress-strategy avg \ --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \ --warmup-updates 25000 \ --clip-norm 10.0 \ --seed 9 --update-freq 8 --patience 5 --keep-last-epochs 7 \ --skip-invalid-size-inputs-valid-test --find-unused-parameters

Hope to get your help, thank you very much🙏

mgaido91 commented 1 year ago

Well, it seems your model is working at all, BLEU is nearly 0.0. My best guess is that you have some problems with either training or inference data. Can you send me the logs of your training and the full generate output? You can also send them to me via email (you can find my email in the paper). Also, please check and send me your config_st.yaml. Thanks.

Crabbit-F commented 1 year ago

thanks for your suggestion. this is the logs of my training, config_st.yaml and the full generate output.thanks a lot! config_st.txt generate-tst-COMMON_st.txt 微信截图_20230711020945

mgaido91 commented 1 year ago

Well, there is definitely something wrong in your training data. The ctc_loss is 0, which is weird, and the ppl on the dev set is very high. Also in the generate it always repeats the same things. Also on the training set the training loss is very high. I can send you my logs, but the main problem is definitely your training data. Please check it, try to regenerate it maybe. The other weird thing is the 0 ctc_loss. I am no sure why you have that. I also do not understand where the tags_loss you have in your logs come from. If you have changed the code be careful that you have not introduced issues e.g. in the collater, which may also be the cause of your problem.

Crabbit-F commented 1 year ago

Thanks for discovering my problem, I haven't changed the code yet. My dataset is must-c en-es and the training command comes from "fbk_works/JOINT_ST_NER2023.md". Is there something wrong with my training command? I'll double check my preprocessing and training data.Thanks again!

mgaido91 commented 1 year ago

I see, I think there is nothing wrong with your training command, but if you send me the full log of your training I can double check. The problem is that your ppl is very high, which means that the training is not working, so I am confident that it is a data issue.

Check your TSV, mine looks like:

id  audio   n_frames    src_text    tgt_text    speaker
ted_1_0 /storage/MT/mgaido/corpora/MuST-C/tagged/en-es/fbank.zip:54923124817:921088 2878    And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful. I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night. Muchas gracias <PERSON>Chris</PERSON>. Y es en verdad un gran honor tener la oportunidad de venir a este escenario por <ORDINAL>segunda</ORDINAL> vez. Estoy extremadamente agradecido. He quedado conmovido por esta conferencia, y deseo agradecer a todos ustedes sus amables comentarios acerca de lo que tenía que decir <TIME>la otra noche</TIME>.   spk.1
ted_1_1 /storage/MT/mgaido/corpora/MuST-C/tagged/en-es/fbank.zip:54785873214:714688 2233    And I say that sincerely, partly because (Mock sob) I need that. (<PERSON>Laughter</PERSON>)    Y digo eso sinceramente, en parte porque — (Sollozos fingidos) — ¡lo necesito! (<PERSON>Risas</PERSON>) ¡Pónganse en mi posición!   spk.1
ted_1_2 /storage/MT/mgaido/corpora/MuST-C/tagged/en-es/fbank.zip:33455963128:451328 1410    (Laughter) Now I have to take off my shoes or boots to get on an airplane! (Laughter) (Applause)    Volé en el avión vicepresidencial por <DATE>ocho años</DATE>. ¡Ahora tengo que quitarme mis zapatos o botas para subirme a un avión! (<PERSON>Risas</PERSON>) (<PERSON>Aplausos</PERSON>)   spk.1
ted_1_3 /storage/MT/mgaido/corpora/MuST-C/tagged/en-es/fbank.zip:17195051777:131328 410 I'll tell you <CARDINAL>one</CARDINAL> quick story to illustrate what that's been like for me.  Les diré una rápida historia para ilustrar lo que ha sido para mí.  spk.1
ted_1_4 /storage/MT/mgaido/corpora/MuST-C/tagged/en-es/fbank.zip:31898357685:124928 390 (Laughter) It's a true story — every bit of this is true.   Es una historia verdadera — cada parte de esto es verdad.   spk.1

You also need to find out why the CTC loss is 0. That usually happens when the input is always longer than the transcript, which should not be the case. So there is something wrong with your data. This may be in the preprocessing, or loading. I would recommend you to start a debugger and check what you have in the forward of cross_entropy_with_tags.py. You can also create a script where you load your data with a SpeechToTextDatasetTagged and you can check that the length of the input audio is the expected one (should be the number of milliseconds / 10 roughly) and that the transcript/translation are correctly loaded.

Crabbit-F commented 1 year ago

Currently, I have found a problem that there are no ner labels in my dataset.

mgaido91 commented 1 year ago

We have labelled the data with Deeppavlov, as described in the paper.

Crabbit-F commented 1 year ago

I am sorry for this, thank you very much for your help, the problem about CTC loss I will confirm further

mgaido91 commented 12 months ago

I am closing this as it has been stale for a while. Feel free to reopen if anything else is needed. Thanks.

Crabbit-F commented 11 months ago

I'm very sorry for my mistake. I found I did't make new training tsv with ner tag. This may be the main reason of 0 bleu. Would you mind sharing the new traing tsv for me? Because I don't know the formate of addtion ner tag. Sincere thanks!

mgaido91 commented 11 months ago

The TSV is formatted as in the example above. You can create it using deeppavlov (https://docs.deeppavlov.ai/en/0.9.0/features/models/ner.html), as we have done (with the model ner_ontonotes_bert_mult).

Crabbit-F commented 11 months ago

Thank you very much!

Crabbit-F commented 11 months ago

nercode nerresult I got a different ner tag than you, so weird.

mgaido91 commented 11 months ago

It is not different, I just formatted it differently. I converted the BIO format (output of Deeppavlov you see) in the format I told you wrapping the text with tags.

Crabbit-F commented 11 months ago

But in my result, almost all word get ner tag but not O.

mgaido91 commented 11 months ago

Mmmmh.... this is weird indeed. I am not sure why this is happening. My script is:

from deeppavlov import configs, build_model
import sys

CHUNK_SIZE = 1000

ner_model = build_model(configs.ner.ner_ontonotes_bert_mult)

def ner(inputs):
    res = ner_model(inputs)
    tokens = res[0]
    nes = res[1]
    for s_i in range(len(tokens)):
        outs = []
        for t_i, token in enumerate(tokens[s_i]):
            ne = nes[s_i][t_i]
            outs.append((t_i, token, ne))
            if nes[s_i][t_i].startswith("I-"):
                i = 1
                while nes[s_i][t_i-i].startswith("I-"):
                    i += 1
                if i > t_i:
                    outs[0] = (outs[0][0], outs[0][1], "B-" + outs[0][2].split("-")[1])
                else:
                    assert nes[s_i][t_i-i].startswith("B-"), "{} /// {}".format(str(nes), str(tokens))

        for o in outs:
            print("{}\t{}\t{}".format(o[0], o[1], o[2]))
        print("")

lines = []
for line in sys.stdin:
    lines.append(line.strip())
    if len(lines) >= CHUNK_SIZE:
        ner(lines)
        lines = []
if len(lines) > 0:
    ner(lines)

Crabbit-F commented 11 months ago

😆You did me a big favor. Thank you very much! I have submitted this question to deeppavlov.

Crabbit-F commented 10 months ago

I re-downloaded the ner model and referred to your script. However, the author of deeppavlov have updated its code. It seems difficult to reproduce the label example you gave. ner

mgaido91 commented 10 months ago

Results may be a bit different (likely better), but apart from that, everything should be fine

Crabbit-F commented 10 months ago

All right, let me reorganize training process? How can I get dev_ep_netagged.tsv?

mgaido91 commented 10 months ago

Same way as the training set using the dev set of Europarl-ST.

Crabbit-F commented 10 months ago

Europarl-ST' dev tsv? Europarl-ST' train tsv after ner?

mgaido91 commented 10 months ago

As dev set, we used Europarl-ST dev set with NER.

Crabbit-F commented 10 months ago

Thanks. If I want only training on MUST-C. How can I change traning script?

mgaido91 commented 10 months ago

You just need to put in --train-subset the name of the TSV for MuST-C. Similarly, for the --valid-subset parameter you can sepcify the TSV for the dev set of MuST-C (as well as that of any dataset you might want to use).

hlt-mt / FBK-fairseq

train about task speech_to_text_tagged #2