marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.22k stars 228 forks source link

transformer.h debugging: how to print out intermediate steps? #330

Closed sshleifer closed 4 years ago

sshleifer commented 4 years ago

I am a C++ noob and trying to diagnose/debug why my python-port of a transformer produces different results than marian-decoder on the same tokens. (the problem is that my python model generates the same token over and over again)

I've been trying to set breakpoints in vscode at https://github.com/marian-nmt/marian/blob/master/src/models/transformer.h#L530, but when I hit them and run -exec fr variable layer (the intermediate outputs), I get

(IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase> > >) layer = <variable not available>

Is there a better/suggested way to print out intermediate outputs for debugging purposes?

Thanks in advance and sorry if the question is dumb.

emjotde commented 4 years ago

What are you trying to view?

sshleifer commented 4 years ago

The first 10 values in the layer tensor.

emjotde commented 4 years ago

Inside the loop, you can add

debug(layer, prefix_ + "_l" + std::to_string(i) + "_self");

When you then run the code, you will see a numpy-like view of the tensor.

sshleifer commented 4 years ago

Ok, that worked, thank you! For the other noobs, you need to run make -j4 after adding debug statements.

emjotde commented 4 years ago

Yes, you need to rebuild.

sshleifer commented 4 years ago

Is there a simple way to make rebuild faster by leveraging the fact that I'm only changing transformer.h?

emjotde commented 4 years ago

Add -DUSE_CCACHE=on. You need to have ccache installed for that. That's a tool that caches objects that do not need to be changed. Should speed up building after the project has been built once.

cmake .. -DUSE_CCACHE=on
sshleifer commented 4 years ago

I've isolated at least one issue, positional embeddings are different. Is there a way to see the signal variable in https://github.com/marian-nmt/marian/blob/master/src/models/transformer.h#L87 ?

emjotde commented 4 years ago

The same way, debug(signal, "PosEmbs"); in that function.

emjotde commented 4 years ago

I think I remember there is a off-by-one difference for position embeddings between Marian and Fairseq, maybe it's the same.

sshleifer commented 4 years ago

I got to the bottom of the positional embeddings differences (it's off by 256 :)), and now the encoders return the same thing. For the decoder, I am trying to understand what the first tensor being sent to the decoder is, but the following debugging statement fails:

debug(startStates, "startStates") // L648, reference to non-static member function must be called

Another command that fails that I'll probably need is

debug(logits, "logits");  // ~line 820 of transformer.h

Thanks again for your help!

emjotde commented 4 years ago

Neither of them are tensors, these are classes on which you cannot call debug.

sshleifer commented 4 years ago

Sometimes my debug statements don't show up in places I would expect them to, for example, in the LayerAttention I added two debugging statements:

    debug(output, "output before postprocess");
    auto opsPost = opt<std::string>("transformer-postprocess");
    output = postProcess(prefix + "_Wo", opsPost, output, input, dropProb);
    debug(output, "output after postprocess");

The first message is always written to standard error, but the second only writes to stderr when it is called by DecoderLayerSelfAttention or by the encoder. Not for cross-attention. The program just keeps going and terminates successfully, there is no error.

Have you seen that sort of behavior?

emjotde commented 4 years ago

The debug statements mark nodes to output their values when they are passed through. If the node is passed through only once you will see only one message, regardless how many debug statements you insert for that node.

sshleifer commented 4 years ago

Can I make a copy of it to see the new value?

emjotde commented 4 years ago

No, it also means there is no new value, otherwise there would be a new node.

emjotde commented 4 years ago

Seems to me the call of postProcess is not really doing anything in your config, just returning the same identical node?

emjotde commented 4 years ago

To summarize. Computation in Marian is never in-place. If you expect a value to change, then there has to be a new node. If there are two distinct nodes, the debug statements will appear twice. If you call the debug statement multiple times on the same node, it will only appear once.

Sometimes nodes are being collapsed already during construction. Say for given a and b you do

Expr x = a + b;
Expr y = a + b;
debug(x, "Message1");
debug(y, "Message2");

You will only see one message, as the code knew that x is the same as y and y was never created, just replaced by x already during creation. But the message gets overwritten, so we will only see Message2.

sshleifer commented 4 years ago

my config has "transformer-postprocess": "dan", and get debug statements from the encoder's calls. Is postprocessing somehow disabled during the decoder? Full Config Here

emjotde commented 4 years ago

No. That should do something. Check if pointer values of output before/after are different. Something like

std::cerr << (size_t)output << std::endl;

should work.

sshleifer commented 4 years ago

doesn't compile:

transformer.h:335:18: error: cannot convert
      'IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase> > >' to
      'size_t' (aka 'unsigned long') without a conversion operator
emjotde commented 4 years ago

Sorry, try this

std::cerr << (size_t)output.get() << std::endl;
sshleifer commented 4 years ago

postProcess gets called and the pointer changes, but debug still does not output the tensor.

Evidence

Logging statements (those with timestamp generated inside postProcess, without timestamps are from std::cerr << (size_t)output.get() << std::endl;)

[2020-04-22 10:54:48] decoder_l1_selfabout to call postprocess
4417515776
[2020-04-22 10:54:48] decoder_l1_self_Wo: postProcess ops: dan
4417516288
[2020-04-22 10:54:48] decoder_l1_selfdone postprocess
[2020-04-22 10:54:48] decoder_l1_contextabout to call postprocess
4417519360
[2020-04-22 10:54:48] decoder_l1_context_Wo: postProcess ops: dan
4417519872

However, even though the pointer changes, the debugging statement associated with the final value does not display: See the following output from debug statements:

[2020-04-22 10:54:48] Debug: decoder_l1_self output before postprocess op=affine
[2020-04-22 10:54:48] shape=1x1x1x512 size=512 type=float32 device=cpu0 ptr=4526488576 bytes=2048
min: -2.17710781 max: 6.77481270 l2-norm: 13.08215678
[[[[  -0.23997170   0.15916049   0.06509697  -0.12403495  -0.60352713 ...  -0.07692891  -0.16747165  -0.24933907  -0.19263674   0.01830016 ]]]]

[2020-04-22 10:54:48] Debug: decoder_l1_selfoutput after postprocess op=layer_normalization
[2020-04-22 10:54:48] shape=1x1x1x512 size=512 type=float32 device=cpu0 ptr=4526488576 bytes=2048
min: -2.98920894 max: 13.56639862 l2-norm: 25.95007426
[[[[  -0.71452057  -0.08474327  -0.73883373  -0.75963336  -1.36101186 ...   0.39763063   0.20678757   0.34616321   0.83726954   0.84396362 ]]]]

[2020-04-22 10:54:48] Debug: decoder_l1_context output before postprocess op=affine  [2020-04-22 10:54:48] shape=1x1x1x512 size=512 type=float32 device=cpu0 ptr=4526490624 bytes=2048
min: -1.78802609 max: 1.57692146 l2-norm: 6.58460730
[[[[  -0.44128269  -0.01378652   0.29056573  -0.67267472   0.37242836 ...   0.51235217  -0.13973767   0.17449422  -0.19592692  -0.11995535 ]]]]

// ALL GOOD UNTIL HERE, then missing "after preprocess statement"

[2020-04-22 10:54:48] Debug: OUTPUT:decoder_l op=layer_normalization
[2020-04-22 10:54:48] shape=1x1x1x512 size=512 type=float32 device=cpu0 ptr=4526488576 bytes=2048
min: -2.09566784 max: 7.83572817 l2-norm: 23.84594046
[[[[  -0.66445428   0.34865451  -0.46758631  -1.03975117  -0.78854591 ...   0.80517358  -0.05014769   0.57723463   1.16044199   0.96584964 ]]]]
emjotde commented 4 years ago

Nope. All as expected. Your very last debug is called on the same node as your last output and you overwrote the debug message. As I said, you will only see one debug message per node. I could create a list of debug messages in the future, so these do not get overwritten, but accumulated.

emjotde commented 4 years ago

Updated the comment above with x = a + b with the information that messages get overwritten with multiple debug calls.

sshleifer commented 4 years ago

The list of debug messages would be wonderful!

sshleifer commented 4 years ago

Fixed the bug. had context_Wo_ln_scale and context_Wo_ln_bias flipped. Thank you so much for helping so much!

emjotde commented 4 years ago

Closing then. Feel free to reopen if there are more issues.