moses-smt / mgiza

A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training.
161 stars 60 forks source link

Non-deterministic results when -ncpus != 1 (mgiza bin) #26

Open cgr71ii opened 2 years ago

cgr71ii commented 2 years ago

Hi!

I have been using mgiza and I have noticed that the generated files does not contain the same information among different executions, not even the same number of lines. This happens when -ncpus != 1. I have tested using the same files and changing -ncpus to 1, 2 and 8. Only when -ncpus 1 is provided, the two executions had exactly the same output files.

Command:

ncpus="1" # deterministic
#ncpus="2" # non-deterministic
#ncpus="8" # non-deterministic

for iteration in $(echo "1 2"); do
  mgiza -ncpus $ncpus -CoocurrenceFile corpus.fr-en.cooc -c corpus.fr-en-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -mh 5 -m5 0 -model1dumpfrequency 1 -o test${iteration}.ncpus${ncpus}.corpus.fr-en -s corpus.en.vcb -t corpus.fr.vcb -emprobforempty 0.0 -probsmooth 1e-7
done

for f1 in $(ls test1.ncpus${ncpus}.corpus.fr-en*); do
  f2=$(echo "$f1" | sed 's/^test1/test2/')
  c=$(comm -3 <(cat "$f1" | sort) <(cat "$f2" | sort) | wc -l)

  if [[ "$c" != "0" ]]; then
    echo "Not equal: $f1 - $f2"
  fi
done

The files has been generated using Bitextor 8.2. The files has been generated using data from this WARC. You may find the necessary files in order to reproduce the results attached in this issue (for corpus.fr-en.cooc.1.zip and corpus.fr-en.cooc.2.zip you will need to decompress and execute cat corpus.fr-en.cooc.1 corpus.fr-en.cooc.2 > corpus.fr-en.cooc).

input_mgiza.zip corpus.fr-en.cooc.2.zip corpus.fr-en.cooc.1.zip

hieuhoang commented 2 years ago

I doubt anyone will look into it. Why is it a problem? In fact, I'm surprised cpu=1 is deterministic

cgr71ii commented 2 years ago

Well... since there is not a proper documentation where I could look into it, I thought it was not the expected. Since you are not surprised about this, am I wrong thinking that to be non-deterministic is the expected?

hieuhoang commented 2 years ago

you're right that the results should be determinstric or non-deterministic regardless of how many threads are used.

I don't know the code that well so don't take my word for it. In my mind, it should be non-determistic during training due to randomness in word clustering. However, you seem to find the it non-deter. even during inference. That could be an issue.

I'm not sure who can come to your rescue, mgiza is abadonware these days. Perhaps @edwardgao, the original author has some time

Btw, running the command with your data crashes for me. I'm not sure if that has anything to do with it

cgr71ii commented 2 years ago

I have run the commands again and they work for me. Have you run

cat corpus.fr-en.cooc.1 corpus.fr-en.cooc.2 > corpus.fr-en.cooc

? I had to split the file to be able to upload it to the issue.

If you share the log perhaps I could find if something is wrong in my installation.