ma-sultan / monolingual-word-aligner

81 stars 25 forks source link

Redundant sourceWordsBeingConsidered #2

Open maali-mnasri opened 8 years ago

maali-mnasri commented 8 years ago

In aligner.py lines 1267 and 1268, each source/target word may be appended many times to the sourceWordsBeingConsidered/targetWordsBeingConsidered lists, which make these lists too big due to redundant elements. I do not see the point of including words indices many times as this makes the next loop (line 1285) very time consuming. To accelerate the execution, I converted sourceWordsBeingConsidered and targetWordsBeingConsidered lists to sets to remove duplicates. It is far faster now and I get the same alignment in testalign.py, however, I want to be sure that this does not deteriorate the alignment quality in other cases. Can you please confirm that removing redudancy is safe?

ma-sultan commented 8 years ago

Thanks for catching this; what you have done is what was originally intended. The alignments should still be the same, because of the two continues on lines 1293 and 1297. I will update the source soon.

maali-mnasri commented 8 years ago

Great! Thank you.

eoehri commented 7 years ago

Hi, I'm also running in performance issues. Could you please provide your adjusted code? Many thanks.

maali-mnasri commented 7 years ago

@eoehri Hi, I just added in aligner.py file these two lines sourceWordIndicesBeingConsidered=list(set(sourceWordIndicesBeingConsidered)) targetWordIndicesBeingConsidered=list(set(targetWordIndicesBeingConsidered)) between line 1282 and line 1285 (just before the loop) . I hope this helps.