rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
8.77k stars 564 forks source link

rr hangs inside docker (but works outside docker) #1784

Closed sidkshatriya closed 7 years ago

sidkshatriya commented 7 years ago

I'm running rr inside a vmware fusion Ubuntu 16.04 VM on a mac. Everything works beautifully.

Now I'm trying to run rr within a docker image (within the 16.04 VM).

I've tried running rr in the following docker images:

The problem is that rr simply hangs when running within docker. Even a simple make test (of rr) starts timing out from the first test onwards.

Whats going on? This is a little puzzling to me cause docker does not constitute another VM so it would work without a hitch, right?

Additionally, I'm using standard docker images and I don't think anything exotic is going on as far as configurations go ...

rocallahan commented 7 years ago

Doesn't the standard docker image block ptrace?

Try running gdb --args rr record ls and hit ctrl-C after it hangs. Although gdb might not work either...

You may also be able to run the entire docker image under rr. That should work.

sidkshatriya commented 7 years ago

I've found a workaround to get rr working... and that is to run docker with the "--privileged" flag.

https://docs.docker.com/engine/reference/commandline/run/#/full-container-capabilities-privileged

Is there a way to avoid using this flag? Something a little bit more fine grained?

rocallahan commented 7 years ago

I know nothing about Docker, sorry.

sidkshatriya commented 7 years ago

Interestingly these are the only perf events available in the non-privileged mode:

List of pre-defined events (to be used in -e):

  branch-instructions OR cpu/branch-instructions/    [Kernel PMU event]
  branch-misses OR cpu/branch-misses/                [Kernel PMU event]
  bus-cycles OR cpu/bus-cycles/                      [Kernel PMU event]
  cache-misses OR cpu/cache-misses/                  [Kernel PMU event]
  cache-references OR cpu/cache-references/          [Kernel PMU event]
  cpu-cycles OR cpu/cpu-cycles/                      [Kernel PMU event]
  cycles-ct OR cpu/cycles-ct/                        [Kernel PMU event]
  cycles-t OR cpu/cycles-t/                          [Kernel PMU event]
  el-abort OR cpu/el-abort/                          [Kernel PMU event]
  el-capacity OR cpu/el-capacity/                    [Kernel PMU event]
  el-commit OR cpu/el-commit/                        [Kernel PMU event]
  el-conflict OR cpu/el-conflict/                    [Kernel PMU event]
  el-start OR cpu/el-start/                          [Kernel PMU event]
  instructions OR cpu/instructions/                  [Kernel PMU event]
  mem-loads OR cpu/mem-loads/                        [Kernel PMU event]
  mem-stores OR cpu/mem-stores/                      [Kernel PMU event]
  msr/aperf/                                         [Kernel PMU event]
  msr/mperf/                                         [Kernel PMU event]
  msr/smi/                                           [Kernel PMU event]
  msr/tsc/                                           [Kernel PMU event]
  tx-abort OR cpu/tx-abort/                          [Kernel PMU event]
  tx-capacity OR cpu/tx-capacity/                    [Kernel PMU event]
  tx-commit OR cpu/tx-commit/                        [Kernel PMU event]
  tx-conflict OR cpu/tx-conflict/                    [Kernel PMU event]
  tx-start OR cpu/tx-start/                          [Kernel PMU event]

  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
   (see 'man perf-list' on how to encode it)

  mem:<addr>[/len][:access]                          [Hardware breakpoint]

This is what I see on for gdb --args rr record ls

[...]
Type "apropos word" to search for commands related to "word"...
Reading symbols from rr...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/bin/rr record ls
warning: Error disabling address space randomization: Operation not permitted
rr: Saving execution to trace directory `/root/.local/share/rr/ls-0'.
[HANGS]

Ctrl+C cannot interrupt the program...

sidkshatriya commented 7 years ago

Additional information -- I wonder if it might be useful.

These are the default capabilities in a docker container https://github.com/docker/docker/blob/master/oci/defaults_linux.go#L64-L79 also see http://man7.org/linux/man-pages/man7/capabilities.7.html (Above two links from https://docs.docker.com/engine/security/security/ )

sidkshatriya commented 7 years ago

A small caveat for those that might want to use --privileged as a workaround for getting rr to work in docker. e.g. See http://obrown.io/2016/02/15/privileged-containers.html

Don’t use privileged containers unless you treat them the same way you treat any other process running as root.

(You can read more on the internet on --privileged ... just google it)

In summary: avoid running rr with this "workaround" unless you're comfortable with the implications of --privileged

khuey commented 7 years ago

FWIW rr runs inside an Ubuntu 16.10 based docker image on my machine without doing anything special.

sidkshatriya commented 7 years ago

Very interesting and useful finding. Just so I understand your setup:

khuey commented 7 years ago

Bare metal. Docker 1.12.1.

sidkshatriya commented 7 years ago

That probably explains it. I'm running docker within a Ubuntu 16.04 VM. The VM itself is running on VMWare Fusion.

Manouchehri commented 6 years ago

No luck here.

root@7b77d892f281:/rr/obj# rr record whoami
rr: Saving execution to trace directory `/root/.local/share/rr/whoami-1'.
You have a Ryzen CPU. The Ryzen retired-conditional-branches hardware
performance counter is not accurate enough; rr will be unreliable.
See https://github.com/mozilla/rr/issues/2034.
[FATAL ../src/PerfCounters.cc:262:start_counter() errno: ENOENT] Unable to open performance counter with 'perf_event_open'; are perf events enabled? Try 'perf record'.
=== Start rr backtrace:
rr(_ZN2rr15notifying_abortEv+0x41)[0x535ca1]
rr(_ZN2rr12FatalOstreamD1Ev+0x6a)[0x486b1a]
rr[0x4985db]
rr[0x49706f]
rr(_ZN2rr12PerfCountersC1Ei+0x23)[0x497c93]
rr(_ZN2rr4TaskC2ERNS_7SessionEiijNS_13SupportedArchE+0x45)[0x5136b5]
rr(_ZN2rr10RecordTaskC1ERNS_13RecordSessionEijNS_13SupportedArchE+0x29)[0x4cfb69]
rr(_ZN2rr13RecordSession8new_taskEiijNS_13SupportedArchE+0x35)[0x4a5fa5]
rr(_ZN2rr4Task5spawnERNS_7SessionERKNS_8ScopedFdERKNS_11TraceStreamERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorISE_SaISE_EESL_i+0x68c)[0x51eb9c]
rr(_ZN2rr13RecordSessionC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorIS6_SaIS6_EESD_NS0_16SyscallBufferingENS_7BindCPUE+0x120)[0x4a5000]
rr(_ZN2rr13RecordSession6createERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EESB_NS0_16SyscallBufferingENS_7BindCPUE+0x101c)[0x4a49dc]
rr(_ZN2rr13RecordCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0xa16)[0x49ca36]
rr(main+0x2ee)[0x53c0be]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7efd5a949830]
rr(_start+0x29)[0x43c689]
=== End rr backtrace
Aborted (core dumped)
root@7b77d892f281:/rr/obj# perf list sw

List of pre-defined events (to be used in -e):

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]
rocallahan commented 6 years ago

You're on Ryzen so that isn't going to work. Also it looks like HW performance events aren't enabled for you. Could be docker, could be the enclosing VM.

khuey commented 6 years ago

When I said things ran under docker without doing anything special I was wrong. Perhaps I had accidentally set --privileged somewhere. I documented a more minimal set of additional permissions required at https://github.com/mozilla/rr/wiki/Docker