Closed kadir-gunel closed 6 years ago
Hm, is this from examples/translate? That should work. What exactly is the output saying when it gives the help screen. The first line or so should contain an error message.
Yes, this happened from examples/translate. Unfortunately I deleted the screen. I will try to load another single model after my training finishes. I will inform you right after.
Is this maybe connected to #65 ?
I am not sure. I wrote that directly to terminal.
Have you actually specified any input for translation by using -i flag or < stdin >
../../build/amun -m en-de/model.npz --input-file INPUTFILE -s en-de/vocab.en.json -t en-de/vocab.de.json --mini-batch 50 --maxi-batch 1000 -b 12 -n --bpe en-de/ende.bpe
or
../../build/amun -m en-de/model.npz -s en-de/vocab.en.json -t en-de/vocab.de.json --mini-batch 50 --maxi-batch 1000 -b 12 -n --bpe en-de/ende.bpe < INPUTFILE > OUTPUTFILE
?
try to set --mini-batch=1
as workaround.
If there is a segfault, I would prefer that you make your model available so that we can look into it rather than try to work around it. The bug will only come back and bite us again
sure, here's my recent model to replicate reported bug. it was trained with marian/amun from 2b9d9f1aac107182b12ad3bf084d717c53a75b4c
commit.
Thanks! I will take a look on that.
since OP didn't provide detailed info on this and i managed to replicate the issue, i thought it would be great if i could provide some details.
i've posted build script and running log that replicate the error here: https://gist.github.com/geovedi/a35ac29fad382f81e8af9c259ee08570
I also have a fresh model that works with --mini-batch 1
and blows up spectacularly with anything higher than that. Packaging now.
@hieuhoang I'll send you the link off-github. Works fine with commit 0f50f93cfe7e9a49704500ef3585ff9c40bc67b8 , BTW. So this is related to the merge.
Btw, has the merge fixed something? Maybe is better to revert that.
debugging with geovedi's model, it still blows up in the pre-merged code with certain mini-batch settings, eg $amunmt/build/amun --config model/model.npz.dev.npz.amun.yml --gpu-threads 1 --cpu-threads 0 --mini-batch 8 --maxi-batch 8 -i data/dev.bpe.en >& temp.gpu It looks like an existing bug which is exposed in slightly different way. Reverting will just hide the bug for certain settings
Hello, I am the owner of the OP. Frankly, I deleted my log file.
@tomekd has a point. The merge may have fixed some things but it's create or exposed other bugs which I can't fix under time pressure. Will revert if no-one objects. Will look at it again when I have time now we have a few problem models we can test with
done. You should delete and git clone amun to make sure you get the reverted master branch
with 0f50f93cfe7e9a49704500ef3585ff9c40bc67b8, the problem with batches still exist on model trained with commit 0fa11878643ad79932fae68cb5d95f367795cb1d.
$ head -n 100 data/dev.bpe.en | /Users/geovedi/Documents/Others/github.com/amunmt/amunmt.new/build/amun -c test.yml --mini-batch 5 --maxi-batch 20
Segmentation fault: 11
still need to set mini-batch
to 1
. also cpu translation issue described in #67 reappear (as expected).
@geovedi If you're still interested in running this model with mini-batch > 1 on amun, I think the problems have been fixed in branch 3d6. Please give it a try & tell me how it goes
ok, will test it now.
ok for GPU, but not for CPU. but i guess that's expected?
here's the build and runtime log.
cool. CPU is someone else's problem
I don't think we have CPU batched translation. Actually, why not?
i think we've discussed this before. The point of batching on GPU is to use as many cores in parallel. Since CPU only has a few cores, multi-threading will do. It also takes a lot of time to do
I am never convinced by that :). When you merge with master, take a look at the regression tests.
Shall I try & merge now? It's getting quite difficult to merge as a lot of changes have been made to both sides. We should put in procedures so that we can push to master more regularly.
What regression test? I've created my own tests based on the failed models people have been sending us
On 14 June 2017 at 14:58, Marcin Junczys-Dowmunt notifications@github.com wrote:
I am never convinced by that :). When you merge with master, take a look at the regression tests.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian/issues/63#issuecomment-308439553, or mute the thread https://github.com/notifications/unsubscribe-auth/AAqOFN_bTJz8NljUyVBAez50wM4qsGrnks5sD-bwgaJpZM4NE_Ls .
I think the segfaults have been fixed. Please git pull and try again. I'll close this issue if there are no further reports
When I try to translate with a single model (say model.iter3000) I use this script :
../../build/amun -m en-de/model.npz -s en-de/vocab.en.json -t en-de/vocab.de.json \ --mini-batch 50 --maxi-batch 1000 -b 12 -n --bpe en-de/ende.bpe
But it gives me all the help screen. And when I try to load single model with -c option (by giving -c model/model.npz.amun.yml) It gives error : Aborted core dump.Any help please ? B.R.