How to test fine tuned model on parallel data?

MusfiqDehan commented 2 years ago

Hi, I have collected about 2.75 million parallel data of Bengali to English sentences. I have already trained with 1000 sentences for testing purposes but I am not understanding how can I will be able to test my trained model! How can I test the demo on my trained model? How can I load my trained model? It would be a great help for me if anyone can help me with this issue. Thanks in advance for the help.

zdou0830 commented 2 years ago

Hi, you can use the following commands to generate aligned word pairs in $OUTPUT_WORD_FILE and see if it makes sense:

DATA_FILE=/path/to/data/file
MODEL_NAME_OR_PATH=/path/to/your/model
OUTPUT_FILE=/path/to/output/file
OUTPUT_WORD_FILE=/path/to/output/word/file

CUDA_VISIBLE_DEVICES=0 awesome-align \
    --output_file=$OUTPUT_FILE \
    --model_name_or_path=$MODEL_NAME_OR_PATH \
    --data_file=$DATA_FILE \
    --extraction 'softmax' \
    --output_word_file=$OUTPUT_WORD_FILE \
    --batch_size 32

If you have the reference alignments (see an example in https://github.com/neulab/awesome-align/blob/master/examples/roen.gold), you can compute the alignment error rates using:

python tools/aer.py $GROUND_TRUTH_FILE $OUTPUT_FILE

, adding --oneRef to the command if your references are one-indexed.

MusfiqDehan commented 2 years ago

Thank you so much for your response.

After the completion of the training process, I have seen a large file named pytorch_model.bin containing all the training weights got saved in the Outputs/ folder. So, I am guessing this is the final fine-tuned model. Now, I want to load this model to test real-world parallel sentences to check how they are aligned. I just can not figure out how to write the corresponding code for this. I badly need to implement this piece of code as I am intending to perform a PoS tagging task using this aligner.

In simple words, I need to be able to input 2 parallel sentences and find the corresponding alignment.

zdou0830 commented 2 years ago

You can first prepare the parallel sentences in a file (e.g. a file named test.src-tgt) with the same format as https://github.com/neulab/awesome-align/blob/master/examples/enfr.src-tgt.

Then, if your pytorch_model.bin is saved in the directory Outputs, you can just set MODEL_NAME_OR_PATH in the previous command to Outputs

Here's an example command:

DATA_FILE=test.src-tgt
MODEL_NAME_OR_PATH=Outputs
OUTPUT_FILE=output.src-tgt
OUTPUT_WORD_FILE=output.words.src-tgt

CUDA_VISIBLE_DEVICES=0 awesome-align \
    --output_file=$OUTPUT_FILE \
    --model_name_or_path=$MODEL_NAME_OR_PATH \
    --data_file=$DATA_FILE \
    --extraction 'softmax' \
    --output_word_file=$OUTPUT_WORD_FILE \
    --batch_size 32

You can then see the output pairs in the i-j format in the file output.src-tgt, a pair i-j indicates that the i-th word (zero-indexed) of the source sentence is aligned to the j-th word of the target sentence.

You can also see the aligned word pairs in the src_word<sep>tgt_word format in output.words.src-tgt.

neulab / awesome-align

How to test fine tuned model on parallel data? #40