recommended Linux distribution and docker

google / schedviz

A tool for gathering and visualizing kernel scheduling traces on Linux machines

Apache License 2.0

519 stars 34 forks source link

recommended Linux distribution and docker #15

Closed alaingautherot closed 4 years ago

alaingautherot commented 4 years ago

Hi, I was trying to run schedviz on a plain vanilla centos7 kernel but am having compile errors (like GLIBC_xxx not being found). Has anybody managed to run schedviz natively on centos7?

Alternatively, has anybody been able to use docker and could provide the Dockerfile?

Thanks in advance, Alain

alaingautherot commented 4 years ago

adding to my own thread. Here's the docker file that I came up with:

FROM docker.io/ubuntu:18.04
MAINTAINER aygauthero@edicogenome.com

RUN apt-get update && apt-get install -y \
  curl \
  locales \
  build-essential \
  time \
  sudo \
  vim-common \
  bc \
  openssl \
  lsb-release \
  moreutils \
  gawk \
  python-minimal \
  unzip \
  zip \
  git \
  libxml2

RUN mkdir  -p /local

# copy from https://github.com/nodesource/distributions/blob/master/deb/setup_12.x
COPY deb_setup_12.x .
RUN bash deb_setup_12.x

# from https://legacy.yarnpkg.com/en/docs/install/#debian-stable
RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -
RUN echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list

RUN apt update && apt install -y yarn

RUN cd /local && git clone https://github.com/google/schedviz.git
RUN cd /local/schedviz && yarn install

#WORKDIR /local/schedviz
#CMD [ "yarn", "bazel", "run", "server", "--", "--", "-storage_path=/staging/TRACES" ]
#RUN cd /local/schedviz && yarn bazel run server -- -- -storage_path=/staging/TRACES

I ran this schedviz container on centos7.6 (3.10.0-1062.1.2.el7.x86_64):

sudo docker run --network=host -it -v /staging:/staging -v /sys/kernel/debug:/sys/kernel/debug  --cap-add SYS_ADMIN  XXYYZZ

cd /local/schedviz && yarn bazel run server -- -- -storage_path=/staging/TRACES

<CTRL-Z>

bg

./util/trace.sh -out /staging/TRACES -capture_seconds 20

Trace seems to be captured but when opening localhost:7402/collections, a protobuf file trace.tar.gz.binproto seems to be missing:

root@ussd-tst-drgn05:/local/schedviz# util/trace.sh -out /staging/TRACES -capture_seconds 20
Trace date 2020-01-30--23:46: capture for 20 seconds, send output to /staging/TRACES
Trace capture started at Thu Jan 30 23:46:49 UTC 2020
Waiting 20 seconds
Copying /sys/kernel/debug/tracing/per_cpu/cpu0/trace_pipe_raw to /staging/alaing/TRACES/tmp/traces/cpu0
...
Copying /sys/kernel/debug/tracing/per_cpu/cpu99/trace_pipe_raw to /staging/alaing/TRACES/tmp/traces/cpu99
Waiting 5 seconds for copies to complete
Creating tar file
Trace capture finished at Thu Jan 30 23:47:15 UTC 2020
root@ussd-tst-drgn05:/local/schedviz# E0130 23:47:33.001921      48 server.go:126] Internal Server Error:
**Failed to list collection metadata: open /staging/TRACES/trace.tar.gz.binproto: no such file or directory**

Request:
POST /list_collection_metadata?request= HTTP/1.1
Host: localhost:7402
Accept: application/json, text/plain, */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Content-Length: 0
Content-Type: text/plain
Origin: http://localhost:7402
Referer: http://localhost:7402/collections
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36

is that a known issue, or just that I don't have a valid setup (docker running on centos7)?

ilhamster commented 4 years ago

Hi Alain,

It looks like you've got two different paths here for the SV traces, /staging/alaing/TRACES and /staging/TRACES. That might be part of the problem.

Second, does whichever path you use exist? I'm not sure that we make it if it doesn't.

Lee

On Thu, Jan 30, 2020 at 4:24 PM alaingautherot notifications@github.com wrote:

adding to my own thread. Here's the docker file that I came up with:

FROM docker.io/ubuntu:18.04 MAINTAINER aygauthero@edicogenome.com

RUN apt-get update && apt-get install -y \ curl \ locales \ build-essential \ time \ sudo \ vim-common \ bc \ openssl \ lsb-release \ moreutils \ gawk \ python-minimal \ unzip \ zip \ git \ libxml2

RUN mkdir -p /local

COPY deb_setup_12.x . RUN bash deb_setup_12.x

RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add - RUN echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list

RUN apt update && apt install -y yarn

RUN cd /local && git clone https://github.com/google/schedviz.git RUN cd /local/schedviz && yarn install

WORKDIR /local/schedviz

CMD [ "yarn", "bazel", "run", "server", "--", "--", "-storage_path=/staging/TRACES" ]

RUN cd /local/schedviz && yarn bazel run server -- -- -storage_path=/staging/TRACES

I ran this schedviz container on centos7.6 (3.10.0-1062.1.2.el7.x86_64):

sudo docker run --network=host -it -v /staging:/staging -v /sys/kernel/debug:/sys/kernel/debug --cap-add SYS_ADMIN XXYYZZ cd /local/schedviz && yarn bazel run server -- -- -storage_path=/staging/TRACES bg ./util/trace.sh -out /staging/TRACES -capture_seconds 30 Rrace seems to be captured but when opening localhost:7402/collections, a protobuf file trace.tar.gz.binproto seems to be missing:

root@ussd-tst-drgn05:/local/schedviz# util/trace.sh -out /staging/alaing/TRACES -capture_seconds 20 Trace date 2020-01-30--23:46: capture for 20 seconds, send output to /staging/alaing/TRACES Trace capture started at Thu Jan 30 23:46:49 UTC 2020 Waiting 20 seconds Copying /sys/kernel/debug/tracing/per_cpu/cpu0/trace_pipe_raw to /staging/alaing/TRACES/tmp/traces/cpu0 ... Copying /sys/kernel/debug/tracing/per_cpu/cpu99/trace_pipe_raw to /staging/alaing/TRACES/tmp/traces/cpu99 Waiting 5 seconds for copies to complete Creating tar file Trace capture finished at Thu Jan 30 23:47:15 UTC 2020 root@ussd-tst-drgn05:/local/schedviz# E0130 23:47:33.001921 48 server.go:126] Internal Server Error: Failed to list collection metadata: open /staging/alaing/TRACES/trace.tar.gz.binproto: no such file or directory

Request: POST /list_collection_metadata?request= HTTP/1.1 Host: localhost:7402 Accept: application/json, text/plain, / Accept-Encoding: gzip, deflate, br Accept-Language: en-US,en;q=0.9 Connection: keep-alive Content-Length: 0 Content-Type: text/plain Origin: http://localhost:7402 Referer: http://localhost:7402/collections Sec-Fetch-Mode http://localhost:7402/collectionsSec-Fetch-Mode: cors Sec-Fetch-Site: same-origin User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36

is that a known issue, or just that I don't have a valid setup (docker running on centos7)?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/schedviz/issues/15?email_source=notifications&email_token=AA27XBYMO225AJXTNP5WJLTRANVUDA5CNFSM4KNOOEG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKNBWKA#issuecomment-580524840, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA27XB55BLD3NH5RKAZJSITRANVUDANCNFSM4KNOOEGQ .

ilhamster commented 4 years ago

Also, you probably know this, but you don't have to have the SV server running while you trace. You can fire it up afterwards, just point it to the trace dir.

On Thu, Jan 30, 2020 at 4:37 PM Lee Baugh il.hamster@gmail.com wrote:

Hi Alain,

It looks like you've got two different paths here for the SV traces, /staging/alaing/TRACES and /staging/TRACES. That might be part of the problem.

Second, does whichever path you use exist? I'm not sure that we make it if it doesn't.

Lee

On Thu, Jan 30, 2020 at 4:24 PM alaingautherot notifications@github.com wrote:

adding to my own thread. Here's the docker file that I came up with:

FROM docker.io/ubuntu:18.04 MAINTAINER aygauthero@edicogenome.com

RUN apt-get update && apt-get install -y \ curl \ locales \ build-essential \ time \ sudo \ vim-common \ bc \ openssl \ lsb-release \ moreutils \ gawk \ python-minimal \ unzip \ zip \ git \ libxml2

RUN mkdir -p /local

COPY deb_setup_12.x . RUN bash deb_setup_12.x

RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add - RUN echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list

RUN apt update && apt install -y yarn

RUN cd /local && git clone https://github.com/google/schedviz.git RUN cd /local/schedviz && yarn install

WORKDIR /local/schedviz

CMD [ "yarn", "bazel", "run", "server", "--", "--", "-storage_path=/staging/TRACES" ]

RUN cd /local/schedviz && yarn bazel run server -- -- -storage_path=/staging/TRACES

I ran this schedviz container on centos7.6 (3.10.0-1062.1.2.el7.x86_64):

sudo docker run --network=host -it -v /staging:/staging -v /sys/kernel/debug:/sys/kernel/debug --cap-add SYS_ADMIN XXYYZZ cd /local/schedviz && yarn bazel run server -- -- -storage_path=/staging/TRACES bg ./util/trace.sh -out /staging/TRACES -capture_seconds 30 Rrace seems to be captured but when opening localhost:7402/collections, a protobuf file trace.tar.gz.binproto seems to be missing:

root@ussd-tst-drgn05:/local/schedviz# util/trace.sh -out /staging/alaing/TRACES -capture_seconds 20 Trace date 2020-01-30--23:46: capture for 20 seconds, send output to /staging/alaing/TRACES Trace capture started at Thu Jan 30 23:46:49 UTC 2020 Waiting 20 seconds Copying /sys/kernel/debug/tracing/per_cpu/cpu0/trace_pipe_raw to /staging/alaing/TRACES/tmp/traces/cpu0 ... Copying /sys/kernel/debug/tracing/per_cpu/cpu99/trace_pipe_raw to /staging/alaing/TRACES/tmp/traces/cpu99 Waiting 5 seconds for copies to complete Creating tar file Trace capture finished at Thu Jan 30 23:47:15 UTC 2020 root@ussd-tst-drgn05:/local/schedviz# E0130 23:47:33.001921 48 server.go:126] Internal Server Error: Failed to list collection metadata: open /staging/alaing/TRACES/trace.tar.gz.binproto: no such file or directory

Request: POST /list_collection_metadata?request= HTTP/1.1 Host: localhost:7402 Accept: application/json, text/plain, / Accept-Encoding: gzip, deflate, br Accept-Language: en-US,en;q=0.9 Connection: keep-alive Content-Length: 0 Content-Type: text/plain Origin: http://localhost:7402 Referer: http://localhost:7402/collections Sec-Fetch-Mode http://localhost:7402/collectionsSec-Fetch-Mode: cors Sec-Fetch-Site: same-origin User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36

is that a known issue, or just that I don't have a valid setup (docker running on centos7)?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/schedviz/issues/15?email_source=notifications&email_token=AA27XBYMO225AJXTNP5WJLTRANVUDA5CNFSM4KNOOEG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKNBWKA#issuecomment-580524840, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA27XB55BLD3NH5RKAZJSITRANVUDANCNFSM4KNOOEGQ .

alaingautherot commented 4 years ago

Hi Lee, I changed the traces location and pasted from different runs, hence the discrepencies. I updated my post to clear the confusion. Here's what I see in /staging/TRACES:

XXX:/local/schedviz# ls -l /staging/TRACES
total 48
-rw-rw-rw- 1 root 59022 45512 Jan 31 00:21 trace.tar.gz

so capture seems to work, but the server seems to be looking for a file that is not created. Maybe schedviz requires linux >= 3.15 since this is when ebpf appeared?

ilhamster commented 4 years ago

It looks like you're running the SV server in a docker. You don't need to.

Moreover, the tracing is full-system; I'm not sure there's an advantage to collection inside a docker, either.

SV doesn't, by default, do any eBPF collection. The trace script you're running works with /sys/kernel/debug/tracing to manage tracepoint collections.

From the looks of your trace, I'd say now just fire up SV, outside the docker, pointing it to the directory with that tar.gz.

On Thu, Jan 30, 2020 at 4:57 PM alaingautherot notifications@github.com wrote:

Hi Lee, I changed the traces location and pasted from different runs, hence the discrepencies. I updated my post to clear the confusion. Here's what I see in /staging/TRACES:

XXX:/local/schedviz# ls -l /staging/TRACES total 48 -rw-rw-rw- 1 root 59022 45512 Jan 31 00:21 trace.tar.gz

so capture seems to work, but the server seems to be looking for a file that is not created. Maybe schedviz requires linux >= 3.15 since this is when ebpf appeared?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/schedviz/issues/15?email_source=notifications&email_token=AA27XB4CGFWTY2FI7K42C63RANZQPA5CNFSM4KNOOEG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKNDRXA#issuecomment-580532444, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA27XB6Q2MDGYYYWPTKSUDDRANZQPANCNFSM4KNOOEGQ .

sabarabc commented 4 years ago

Are you copying the trace file into the folder or uploading it with the upload trace button in the UI?

The folder you pass as an argument to SchedViz must be empty and is completely managed by it. To add a trace to SchedViz you must use the button on the collections page.

alaingautherot commented 4 years ago

I've tried to build schedviz from the host (centos7) but always get a link error from clang9 which is built as part of the process. Using docker was an easy way to work around that issue.

ERROR: /home/aygauthero`/.cache/bazel/_bazel_aygauthero/c94fc734936e4aff1c1b50399cc40aec/external/net_zlib/BUILD.bazel:32:1: C++ compilation of rule '@net_zlib//:zlib' failed (Exit 1) clang failed: error executing command external/llvm_toolchain/bin/clang -U_FORTIFY_SOURCE -fstack-protector -fno-omit-frame-pointer -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 26 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
external/llvm_toolchain/bin/clang: /lib64/libtinfo.so.5: no version information available (required by external/llvm_toolchain/bin/clang)
external/llvm_toolchain/bin/clang: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by external/llvm_toolchain/bin/clang)
external/llvm_toolchain/bin/clang: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by external/llvm_toolchain/bin/clang)
Target //server:server failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 1586.821s, Critical Path: 25.66s
INFO: 13 processes: 7 processwrapper-sandbox, 6 worker.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

It looks like you're running the SV server in a docker. You don't need to. Moreover, the tracing is full-system; I'm not sure there's an advantage to collection inside a docker, either. SV doesn't, by default, do any eBPF collection. The trace script you're running works with /sys/kernel/debug/tracing to manage tracepoint collections. From the looks of your trace, I'd say now just fire up SV, outside the docker, pointing it to the directory with that tar.gz. … On Thu, Jan 30, 2020 at 4:57 PM alaingautherot @.***> wrote: Hi Lee, I changed the traces location and pasted from different runs, hence the discrepencies. I updated my post to clear the confusion. Here's what I see in /staging/TRACES: XXX:/local/schedviz# ls -l /staging/TRACES total 48 -rw-rw-rw- 1 root 59022 45512 Jan 31 00:21 trace.tar.gz so capture seems to work, but the server seems to be looking for a file that is not created. Maybe schedviz requires linux >= 3.15 since this is when ebpf appeared? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#15?email_source=notifications&email_token=AA27XB4CGFWTY2FI7K42C63RANZQPA5CNFSM4KNOOEG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKNDRXA#issuecomment-580532444>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA27XB6Q2MDGYYYWPTKSUDDRANZQPANCNFSM4KNOOEGQ .

alaingautherot commented 4 years ago

Are you copying the trace file into the folder or uploading it with the upload trace button in the UI?

The folder you pass as an argument to SchedViz must be empty and is completely managed by it. To add a trace to SchedViz you must use the button on the collections page.

I just tried this approach of selecting the file trace.tar.gz that I captured earlier and get this error:

Failed to upload trace file trace.tar.gz
Reason:
 Internal Server Error:
Failed to upload trace file: rpc error: code = InvalidArgument desc = inference error (CPU) between '[Event 666745] CPU policies: [Fail, Fail] State policies: [Fail, Fail] @80057184515        PID   75586 Command: [48->48] Priority: [120->120] CPU: [CPU   9->CPU   9] State: [Running->Sleeping]' and '[Event 666781] CPU policies: [Fail, Fail] State policies: [Fail, Fail] @80057598255        PID   75586 Command: [48->48] Priority: [120->120] CPU: [CPU   1->CPU   1] State: [Unknown->Running]'

Request:
POST /upload HTTP/1.1
Host: localhost:7402
Accept: application/json, text/plain, */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Content-Length: 17910638
Content-Type: multipart/form-data; boundary=----WebKitFormBoundary6WXxrQW5nFnshBHp
Origin: http://localhost:7402
Referer: http://localhost:7402/collections
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36

Does it mean the trace is not 'clean'?

tobyriddell commented 4 years ago

@alaingautherot I was very pleased to find your Dockerfile as I've been looking into running the schedviz server from Docker.

Is there any way to get the server to start without downloading a ton of files from the Internet? I have not used yarn or bazel before and I am hoping that is a way to tell it to download all the files needed during the image-building stage and then run the server without further Internet access.

ilhamster commented 4 years ago

It might not. That error means that the trace's events were not entirely compatible with one another; for example, a thread switching in on CPU 1 after it has already been observed to migrate from CPU 1 to CPU 2.

This could be due to a few things:

Misaligned per-CPU clocks (generally rdtsc) such that events reported by different CPUs have different time bases
Some somewhat-unavoidable-and-rare kernel race conditions
Buffer overflow, which happens if you collect for too long with too small a buffer.

I describe a way to try to work around that in this message: https://github.com/google/schedviz/issues/14#issuecomment-572296868. Give that change a try, restart your server, and see if you can import that file. Also, if you've set your buffer really small, consider increasing it, or if your trace duration is too long, reduce it. Generally internally we use 8MiB per-CPU buffers for O(5 second) traces.

Lee

On Fri, Jan 31, 2020, 12:34 alaingautherot notifications@github.com wrote:

Are you copying the trace file into the folder or uploading it with the upload trace button in the UI?

The folder you pass as an argument to SchedViz must be empty and is completely managed by it. To add a trace to SchedViz you must use the button on the collections page.

I just tried this approach of selecting the file trace.tar.gz that I captured earlier and get this error:

Failed to upload trace file trace.tar.gz Reason: Internal Server Error: Failed to upload trace file: rpc error: code = InvalidArgument desc = inference error (CPU) between '[Event 666745] CPU policies: [Fail, Fail] State policies: [Fail, Fail] @80057184515 PID 75586 Command: [48->48] Priority: [120->120] CPU: [CPU 9->CPU 9] State: [Running->Sleeping]' and '[Event 666781] CPU policies: [Fail, Fail] State policies: [Fail, Fail] @80057598255 PID 75586 Command: [48->48] Priority: [120->120] CPU: [CPU 1->CPU 1] State: [Unknown->Running]'

Request: POST /upload HTTP/1.1

Host: localhost:7402

Accept: application/json, text/plain, /

Accept-Encoding: gzip, deflate, br

Accept-Language: en-US,en;q=0.9

Connection: keep-alive

Content-Length: 17910638

Content-Type: multipart/form-data; boundary=----WebKitFormBoundary6WXxrQW5nFnshBHp

Origin: http://localhost:7402

Referer: http://localhost:7402/collections

Sec-Fetch-Mode: cors

Sec-Fetch-Site: same-origin

User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36
Does it mean the capture is not 'clean'?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/google/schedviz/issues/15?email_source=notifications&email_token=AA27XB6RPHCA2YI3Q2RKPVLRASDM7A5CNFSM4KNOOEG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKP5PBA#issuecomment-580900740>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA27XB76X6RRVNCJI5DMKUTRASDM7ANCNFSM4KNOOEGQ>
.

sabarabc commented 4 years ago

@tobyriddell We have a prebuilt docker image as well (experimental though, haven't really tested it). It can be built with bazel build server:server_image.tar. The image will be located at bazel-bin/server/server_image.tar and can be loaded with docker load --input bazel-bin/server/server_image.tar. It can be run with docker run -p 8402:8080 -v `pwd`/data:/data bazel/server:server_image, where `pwd`/data is the location of the storage path folder and 8402 is the port. For this example, SchedViz can be accessed at http://localhost:8402.

alaingautherot commented 4 years ago

It might not. That error means that the trace's events were not entirely compatible with one another; for example, a thread switching in on CPU 1 after it has already been observed to migrate from CPU 1 to CPU 2. This could be due to a few things: Misaligned per-CPU clocks (generally rdtsc) such that events reported by different CPUs have different time bases Some somewhat-unavoidable-and-rare kernel race conditions * Buffer overflow, which happens if you collect for too long with too small a buffer. I describe a way to try to work around that in this message: #14 (comment). Give that change a try, restart your server, and see if you can import that file. Also, if you've set your buffer really small, consider increasing it, or if your trace duration is too long, reduce it. Generally internally we use 8MiB per-CPU buffers for O(5 second) traces. Lee … On Fri, Jan 31, 2020, 12:34 alaingautherot @.**> wrote: Are you copying the trace file into the folder or uploading it with the upload trace button in the UI? The folder you pass as an argument to SchedViz must be empty and is completely managed by it. To add a trace to SchedViz you must use the button on the collections page. I just tried this approach of selecting the file trace.tar.gz that I captured earlier and get this error: Failed to upload trace file trace.tar.gz Reason: Internal Server Error: Failed to upload trace file: rpc error: code = InvalidArgument desc = inference error (CPU) between '[Event 666745] CPU policies: [Fail, Fail] State policies: [Fail, Fail] @8005718 PID 75586 Command: [48->48] Priority: [120->120] CPU: [CPU 9->CPU 9] State: [Running->Sleeping]' and '[Event 666781] CPU policies: [Fail, Fail] State policies: [Fail, Fail] @8005759 PID 75586 Command: [48->48] Priority: [120->120] CPU: [CPU 1->CPU 1] State: [Unknown->Running]' Request: POST /upload HTTP/1.1 Host: localhost:7402 Accept: application/json, text/plain, /* Accept-Encoding: gzip, deflate, br Accept-Language: en-US,en;q=0.9 Connection: keep-alive Content-Length: 17910638 Content-Type: multipart/form-data; boundary=----WebKitFormBoundary6WXxrQW5nFnshBHp Origin: http://localhost:7402 Referer: http://localhost:7402/collections Sec-Fetch-Mode: cors Sec-Fetch-Site: same-origin User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36 ``` Does it mean the capture is not 'clean'? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#15?email_source=notifications&email_token=AA27XB6RPHCA2YI3Q2RKPVLRASDM7A5CNFSM4KNOOEG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKP5PBA#issuecomment-580900740>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA27XB76X6RRVNCJI5DMKUTRASDM7ANCNFSM4KNOOEGQ .

@sabarabc I just tried what you suggested (bigger buffer, shorter capture duration) and was able to get the trace loaded in the GUI. It doesn't show much, the capture from my docker container may not be correct. I'll try the docker image you mentioned in one of your posts. Thanks for the tips!

alaingautherot commented 4 years ago

@alaingautherot I was very pleased to find your Dockerfile as I've been looking into running the schedviz server from Docker.

Is there any way to get the server to start without downloading a ton of files from the Internet? I have not used yarn or bazel before and I am hoping that is a way to tell it to download all the files needed during the image-building stage and then run the server without further Internet access.

this is my next step. I wanted to first capture something and show it in the GUI. Luckily enough, @sabarabc potentially has a better docker image, I will give it a try.

tobyriddell commented 4 years ago

@tobyriddell We have a prebuilt docker image as well (experimental though, haven't really tested it). It can be built with bazel build server:server_image.tar. The image will be located at bazel-bin/server/server_image.tar and can be loaded with docker load --input bazel-bin/server/server_image.tar. It can be run with docker run -p 8402:8080 -v `pwd`/data:/data bazel/server:server_image, where `pwd`/data is the location of the storage path folder and 8402 is the port. For this example, SchedViz can be accessed at http://localhost:8402.

Works great! Thank you!

ilhamster commented 4 years ago

Glad the buffer/profile duration changes worked, you're welcome. We need to add logic to clip events in cases like that. One day...

On Fri, Jan 31, 2020 at 2:56 PM alaingautherot notifications@github.com wrote:

It might not. That error means that the trace's events were not entirely compatible with one another; for example, a thread switching in on CPU 1 after it has already been observed to migrate from CPU 1 to CPU 2. This could be due to a few things: Misaligned per-CPU clocks (generally rdtsc) such that events reported by different CPUs have different time bases Some somewhat-unavoidable-and-rare kernel race conditions Buffer overflow, which happens if you collect for too long with too small a buffer. I describe a way to try to work around that in this message: #14 (comment) https://github.com/google/schedviz/issues/14#issuecomment-572296868. Give that change a try, restart your server, and see if you can import that file. Also, if you've set your buffer really small, consider increasing it, or if your trace duration is too long, reduce it. Generally internally we use 8MiB per-CPU buffers for O(5 second) traces. Lee … <#m-7774923628987604864> On Fri, Jan 31, 2020, 12:34 alaingautherot @*.**> wrote: Are you copying the trace file into the folder or uploading it with the upload trace button in the UI? The folder you pass as an argument to SchedViz must be empty and is completely managed by it. To add a trace to SchedViz you must use the button on the collections page. I just tried this approach of selecting the file trace.tar.gz that I captured earlier and get this error: Failed to upload trace file trace.tar.gz Reason: Internal Server Error: Failed to upload trace file: rpc error: code = InvalidArgument desc = inference error (CPU) between '[Event 666745] CPU policies: [Fail, Fail] State policies: [Fail, Fail] @8005718 https://github.com/google/schedviz/commit/80057184515 PID 75586 Command: [48->48] Priority: [120->120] CPU: [CPU 9->CPU 9] State: [Running->Sleeping]' and '[Event 666781] CPU policies: [Fail, Fail] State policies: [Fail, Fail] @8005759 https://github.com/google/schedviz/commit/80057598255 PID 75586 Command: [48->48] Priority: [120->120] CPU: [CPU 1->CPU 1] State: [Unknown->Running]' Request: POST /upload HTTP/1.1 Host: localhost:7402 Accept: application/json, text/plain, /* Accept-Encoding: gzip, deflate, br Accept-Language: en-US,en;q=0.9 Connection: keep-alive Content-Length: 17910638 Content-Type: multipart/form-data; boundary=----WebKitFormBoundary6WXxrQW5nFnshBHp Origin: http://localhost:7402 Referer: http://localhost:7402/collections Sec-Fetch-Mode: cors Sec-Fetch-Site: same-origin User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36 ``` Does it mean the capture is not 'clean'? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#15 https://github.com/google/schedviz/issues/15?email_source=notifications&email_token=AA27XB6RPHCA2YI3Q2RKPVLRASDM7A5CNFSM4KNOOEG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKP5PBA#issuecomment-580900740>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA27XB76X6RRVNCJI5DMKUTRASDM7ANCNFSM4KNOOEGQ .

@sabarabc https://github.com/sabarabc I just tried what you suggested (bigger buffer, shorter capture duration) and was able to get the trace loaded in the GUI. It doesn't show much, the capture from my docker container may not be correct. I'll try the docker image you mentioned in one of your posts. Thanks for the tips!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/schedviz/issues/15?email_source=notifications&email_token=AA27XBYYINE6NVQXIGXMITTRASUDTA5CNFSM4KNOOEG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKQI2RI#issuecomment-580947269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA27XB47EHM7TCEQJHODFA3RASUDTANCNFSM4KNOOEGQ .

alaingautherot commented 4 years ago

@tobyriddell We have a prebuilt docker image as well (experimental though, haven't really tested it). It can be built with bazel build server:server_image.tar. The image will be located at bazel-bin/server/server_image.tar and can be loaded with docker load --input bazel-bin/server/server_image.tar. It can be run with docker run -p 8402:8080 -v `pwd`/data:/data bazel/server:server_image, where `pwd`/data is the location of the storage path folder and 8402 is the port. For this example, SchedViz can be accessed at http://localhost:8402.

Works great! Thank you!

@sabarabc @tobyriddell FYI, bazel 2.0 doesn't work. bazel 1.2.1 seems better Here's the error I get with bazel 2.0:

WARNING: Output base '/home/XXX/.cache/bazel/_bazel_aygauthero/210281ab89b697872702d47e2090eeed' is on NFS. This may lead to surprising failures and undetermined behavior.
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=245
INFO: Reading rc options for 'build' from /illumina/scratch/DRAGEN/users/aygauthero/GIT/schedviz/.bazelrc:
  'build' options: --host_force_python=PY2 --crosstool_top=@llvm_toolchain//:toolchain --incompatible_strict_action_env --incompatible_new_actions_api=false --experimental_allow_incremental_repository_updates --incompatible_depset_is_not_iterable=false
ERROR: Unrecognized option: --incompatible_depset_is_not_iterable=false

alaingautherot commented 4 years ago

@sabarabc @tobyriddell No luck for me with the experimental docker. I get that link error (below), no matter what I do.

May I ask what linux distribution and version you're using?

$ LD_LIBRARY_PATH=/usr/local/lib:/usr/lib:/usr/local/lib64:/usr/lib64 bazel build server:server_image.tar
WARNING: Output base '/home/XXX/.cache/bazel/_bazel_XXX/210281ab89b697872702d47e2090eeed' is on NFS. This may lead to surprising failures and undetermined behavior.
INFO: Analyzed target //server:server_image.tar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
ERROR: /home/XXX/.cache/bazel/_bazel_XXX/210281ab89b697872702d47e2090eeed/external/net_zlib/BUILD.bazel:32:1: C++ compilation of rule '@net_zlib//:zlib' failed (Exit 1) clang failed: error executing command external/llvm_toolchain/bin/clang -U_FORTIFY_SOURCE -fstack-protector -fno-omit-frame-pointer -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 26 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
external/llvm_toolchain/bin/clang: /lib64/libtinfo.so.5: no version information available (required by external/llvm_toolchain/bin/clang)
external/llvm_toolchain/bin/clang: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by external/llvm_toolchain/bin/clang)
external/llvm_toolchain/bin/clang: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by external/llvm_toolchain/bin/clang)
Target //server:server_image.tar failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /XXX/GIT/schedviz/server/BUILD.bazel:31:1 C++ compilation of rule '@net_zlib//:zlib' failed (Exit 1) clang failed: error executing command external/llvm_toolchain/bin/clang -U_FORTIFY_SOURCE -fstack-protector -fno-omit-frame-pointer -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 26 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
INFO: Elapsed time: 23.544s, Critical Path: 15.61s
INFO: 2 processes: 2 processwrapper-sandbox.
FAILED: Build did NOT complete successfully

alaingautherot commented 4 years ago

ok, I eventually managed to retrieve and use the docker image on centos7. I had to hack the system a bit. I took the libstdc++.so.6 library from gcc9.2 (libstdc++.so.6.0.27) and copied it to /lib64 and make /lib64/libstdc++.so.6 point to it like so:

$ ls -l /lib64/libstdc++.so.6*                     
lrwxrwxrwx 1 root root      19 Jan 31 17:09 /lib64/libstdc++.so.6 -> libstdc++.so.6.0.27    
-rwxr-xr-x 1 root root  991616 Aug  6 09:52 /lib64/libstdc++.so.6.0.19                      
-rwxr-xr-x 1 root root 1946976 Jan 31 17:09 /lib64/libstdc++.so.6.0.27                      
lrwxrwxrwx 1 root root      19 Oct 10 10:31 /lib64/libstdc++.so.6.bak -> libstdc++.so.6.0.19

There may be a better -not so hacky- way to do it, but I'll live with it. Closing issue now.

tobyriddell commented 4 years ago

@sabarabc @tobyriddell No luck for me with the experimental docker. I get that link error (below), no matter what I do.

May I ask what linux distribution and version you're using?

$ LD_LIBRARY_PATH=/usr/local/lib:/usr/lib:/usr/local/lib64:/usr/lib64 bazel build server:server_image.tar
WARNING: Output base '/home/XXX/.cache/bazel/_bazel_XXX/210281ab89b697872702d47e2090eeed' is on NFS. This may lead to surprising failures and undetermined behavior.
INFO: Analyzed target //server:server_image.tar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
ERROR: /home/XXX/.cache/bazel/_bazel_XXX/210281ab89b697872702d47e2090eeed/external/net_zlib/BUILD.bazel:32:1: C++ compilation of rule '@net_zlib//:zlib' failed (Exit 1) clang failed: error executing command external/llvm_toolchain/bin/clang -U_FORTIFY_SOURCE -fstack-protector -fno-omit-frame-pointer -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 26 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
external/llvm_toolchain/bin/clang: /lib64/libtinfo.so.5: no version information available (required by external/llvm_toolchain/bin/clang)
external/llvm_toolchain/bin/clang: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by external/llvm_toolchain/bin/clang)
external/llvm_toolchain/bin/clang: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by external/llvm_toolchain/bin/clang)
Target //server:server_image.tar failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /XXX/GIT/schedviz/server/BUILD.bazel:31:1 C++ compilation of rule '@net_zlib//:zlib' failed (Exit 1) clang failed: error executing command external/llvm_toolchain/bin/clang -U_FORTIFY_SOURCE -fstack-protector -fno-omit-frame-pointer -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 26 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
INFO: Elapsed time: 23.544s, Critical Path: 15.61s
INFO: 2 processes: 2 processwrapper-sandbox.
FAILED: Build did NOT complete successfully

@alaingautherot Apologies for not replying sooner. I am using a Virtualbox with a Ubuntu VM downloaded from osboxes.org (Ubuntu 19.10)

Once VM is running, here are the commands to build the tar file:

sudo apt-get update
sudo apt-get install -y git build-essential unzip curl libtinfo5
curl -sL https://deb.nodesource.com/setup_13.x | sudo -E bash -
sudo apt-get install -y nodejs
git clone https://github.com/google/schedviz.git
cd schedviz
yarn base build server:server_image.tar