Closed MusfiqDehan closed 2 years ago
Hi, you can use the following commands to generate aligned word pairs in $OUTPUT_WORD_FILE
and see if it makes sense:
DATA_FILE=/path/to/data/file
MODEL_NAME_OR_PATH=/path/to/your/model
OUTPUT_FILE=/path/to/output/file
OUTPUT_WORD_FILE=/path/to/output/word/file
CUDA_VISIBLE_DEVICES=0 awesome-align \
--output_file=$OUTPUT_FILE \
--model_name_or_path=$MODEL_NAME_OR_PATH \
--data_file=$DATA_FILE \
--extraction 'softmax' \
--output_word_file=$OUTPUT_WORD_FILE \
--batch_size 32
If you have the reference alignments (see an example in https://github.com/neulab/awesome-align/blob/master/examples/roen.gold), you can compute the alignment error rates using:
python tools/aer.py $GROUND_TRUTH_FILE $OUTPUT_FILE
, adding --oneRef
to the command if your references are one-indexed.
Thank you so much for your response.
After the completion of the training process, I have seen a large file named pytorch_model.bin
containing all the training weights got saved in the Outputs/
folder. So, I am guessing this is the final fine-tuned model. Now, I want to load this model to test real-world parallel sentences to check how they are aligned. I just can not figure out how to write the corresponding code for this. I badly need to implement this piece of code as I am intending to perform a PoS tagging task using this aligner.
In simple words, I need to be able to input 2 parallel sentences and find the corresponding alignment.
You can first prepare the parallel sentences in a file (e.g. a file named test.src-tgt
) with the same format as https://github.com/neulab/awesome-align/blob/master/examples/enfr.src-tgt.
Then, if your pytorch_model.bin
is saved in the directory Outputs
, you can just set MODEL_NAME_OR_PATH
in the previous command to Outputs
Here's an example command:
DATA_FILE=test.src-tgt
MODEL_NAME_OR_PATH=Outputs
OUTPUT_FILE=output.src-tgt
OUTPUT_WORD_FILE=output.words.src-tgt
CUDA_VISIBLE_DEVICES=0 awesome-align \
--output_file=$OUTPUT_FILE \
--model_name_or_path=$MODEL_NAME_OR_PATH \
--data_file=$DATA_FILE \
--extraction 'softmax' \
--output_word_file=$OUTPUT_WORD_FILE \
--batch_size 32
You can then see the output pairs in the i-j
format in the file output.src-tgt
, a pair i-j
indicates that the i
-th word (zero-indexed) of the source sentence is aligned to the j
-th word of the target sentence.
You can also see the aligned word pairs in the src_word<sep>tgt_word
format in output.words.src-tgt
.
Hi, I have collected about 2.75 million parallel data of Bengali to English sentences. I have already trained with 1000 sentences for testing purposes but I am not understanding how can I will be able to test my trained model! How can I test the
demo
on my trained model? How can I load my trained model? It would be a great help for me if anyone can help me with this issue. Thanks in advance for the help.