marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.23k stars 230 forks source link

Translating with a single model #63

Closed kadir-gunel closed 6 years ago

kadir-gunel commented 7 years ago

When I try to translate with a single model (say model.iter3000) I use this script : ../../build/amun -m en-de/model.npz -s en-de/vocab.en.json -t en-de/vocab.de.json \ --mini-batch 50 --maxi-batch 1000 -b 12 -n --bpe en-de/ende.bpe But it gives me all the help screen. And when I try to load single model with -c option (by giving -c model/model.npz.amun.yml) It gives error : Aborted core dump.

Any help please ? B.R.

emjotde commented 7 years ago

Hm, is this from examples/translate? That should work. What exactly is the output saying when it gives the help screen. The first line or so should contain an error message.

kadir-gunel commented 7 years ago

Yes, this happened from examples/translate. Unfortunately I deleted the screen. I will try to load another single model after my training finishes. I will inform you right after.

emjotde commented 7 years ago

Is this maybe connected to #65 ?

kadir-gunel commented 7 years ago

I am not sure. I wrote that directly to terminal.

mehmedes commented 7 years ago

Have you actually specified any input for translation by using -i flag or < stdin >

../../build/amun -m en-de/model.npz --input-file INPUTFILE -s en-de/vocab.en.json -t en-de/vocab.de.json --mini-batch 50 --maxi-batch 1000 -b 12 -n --bpe en-de/ende.bpe

or

../../build/amun -m en-de/model.npz -s en-de/vocab.en.json -t en-de/vocab.de.json --mini-batch 50 --maxi-batch 1000 -b 12 -n --bpe en-de/ende.bpe < INPUTFILE > OUTPUTFILE

?

geovedi commented 7 years ago

try to set --mini-batch=1 as workaround.

hieuhoang commented 7 years ago

If there is a segfault, I would prefer that you make your model available so that we can look into it rather than try to work around it. The bug will only come back and bite us again

geovedi commented 7 years ago

sure, here's my recent model to replicate reported bug. it was trained with marian/amun from 2b9d9f1aac107182b12ad3bf084d717c53a75b4c commit.

tomekd commented 7 years ago

Thanks! I will take a look on that.

geovedi commented 7 years ago

since OP didn't provide detailed info on this and i managed to replicate the issue, i thought it would be great if i could provide some details.

i've posted build script and running log that replicate the error here: https://gist.github.com/geovedi/a35ac29fad382f81e8af9c259ee08570

emjotde commented 7 years ago

I also have a fresh model that works with --mini-batch 1 and blows up spectacularly with anything higher than that. Packaging now.

emjotde commented 7 years ago

@hieuhoang I'll send you the link off-github. Works fine with commit 0f50f93cfe7e9a49704500ef3585ff9c40bc67b8 , BTW. So this is related to the merge.

tomekd commented 7 years ago

Btw, has the merge fixed something? Maybe is better to revert that.

hieuhoang commented 7 years ago

debugging with geovedi's model, it still blows up in the pre-merged code with certain mini-batch settings, eg $amunmt/build/amun --config model/model.npz.dev.npz.amun.yml --gpu-threads 1 --cpu-threads 0 --mini-batch 8 --maxi-batch 8 -i data/dev.bpe.en >& temp.gpu It looks like an existing bug which is exposed in slightly different way. Reverting will just hide the bug for certain settings

kadir-gunel commented 7 years ago

Hello, I am the owner of the OP. Frankly, I deleted my log file.

hieuhoang commented 7 years ago

@tomekd has a point. The merge may have fixed some things but it's create or exposed other bugs which I can't fix under time pressure. Will revert if no-one objects. Will look at it again when I have time now we have a few problem models we can test with

hieuhoang commented 7 years ago

done. You should delete and git clone amun to make sure you get the reverted master branch

geovedi commented 7 years ago

with 0f50f93cfe7e9a49704500ef3585ff9c40bc67b8, the problem with batches still exist on model trained with commit 0fa11878643ad79932fae68cb5d95f367795cb1d.

$ head -n 100 data/dev.bpe.en | /Users/geovedi/Documents/Others/github.com/amunmt/amunmt.new/build/amun -c test.yml --mini-batch 5 --maxi-batch 20
Segmentation fault: 11

still need to set mini-batch to 1. also cpu translation issue described in #67 reappear (as expected).

hieuhoang commented 7 years ago
  1. GPU mini-batch - my changes was supposed to fix that but it didn't so I've reverted. So for now, please set mini-batch=1
  2. CPU issue - @tomekd had fixed it in the merged code that I reverted. https://github.com/amunmt/amunmt/commit/0fa11878643ad79932fae68cb5d95f367795cb1d. So you can re-apply that commit to master, or wait for tomasz to do it
hieuhoang commented 7 years ago

@geovedi If you're still interested in running this model with mini-batch > 1 on amun, I think the problems have been fixed in branch 3d6. Please give it a try & tell me how it goes

geovedi commented 7 years ago

ok, will test it now.

geovedi commented 7 years ago

ok for GPU, but not for CPU. but i guess that's expected?

here's the build and runtime log.

hieuhoang commented 7 years ago

cool. CPU is someone else's problem

emjotde commented 7 years ago

I don't think we have CPU batched translation. Actually, why not?

hieuhoang commented 7 years ago

i think we've discussed this before. The point of batching on GPU is to use as many cores in parallel. Since CPU only has a few cores, multi-threading will do. It also takes a lot of time to do

emjotde commented 7 years ago

I am never convinced by that :). When you merge with master, take a look at the regression tests.

hieuhoang commented 7 years ago

Shall I try & merge now? It's getting quite difficult to merge as a lot of changes have been made to both sides. We should put in procedures so that we can push to master more regularly.

What regression test? I've created my own tests based on the failed models people have been sending us

On 14 June 2017 at 14:58, Marcin Junczys-Dowmunt notifications@github.com wrote:

I am never convinced by that :). When you merge with master, take a look at the regression tests.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian/issues/63#issuecomment-308439553, or mute the thread https://github.com/notifications/unsubscribe-auth/AAqOFN_bTJz8NljUyVBAez50wM4qsGrnks5sD-bwgaJpZM4NE_Ls .

hieuhoang commented 7 years ago

I think the segfaults have been fixed. Please git pull and try again. I'll close this issue if there are no further reports