open-telemetry / opentelemetry-erlang

OpenTelemetry Erlang SDK
https://opentelemetry.io
Apache License 2.0
327 stars 103 forks source link

opentelemetry seems to not be starting in released code #458

Closed smashedtoatoms closed 2 years ago

smashedtoatoms commented 2 years ago

I am positive that this is user error on my side at this point. I am desperately trying to send some data to honeycomb. It's working locally. It doesn't work in prod. I have posted a bunch of details here. If you have any suggestions, I am all ears. I really have no idea why nothing is working in prod, and at this point, I am not even sure who I should be asking.

kw7oe commented 2 years ago

Assuming the code you use are as mentioned in the fly community forum, perhaps try appending the ports number 443 to the end of your otlp_endpoints? I somewhat remember that by the exporter use a default OTLP ports, that might not work with Honeycomb. (I might be wrong, didn't really check through the code yet)

For more you can refer to this example, specifically the config/runtime.exs file.

Updates:

Checked both opentelemetry-exporter README and the document you follow based on your fly community post, both code examples have :443 appended in the end of the Honeycomb url:

# runtime.exs
config :opentelemetry_exporter,
  otlp_protocol: :http_protobuf,
  otlp_endpoint: "https://api.honeycomb.io:443",
  otlp_headers: [
    {"x-honeycomb-team", System.get_env("HONEYCOMB_API_KEY")},
    {"x-honeycomb-dataset", System.get_env("FLY_APP_NAME")}
  ]

Might worth trying this out and see if it works for you

tsloughter commented 2 years ago

To be clear from your fly.io forum post, you aren't seeing any spans on fly.io even when running with the stdout exporter?

Are you deploying a release? Can you show your deps?

tsloughter commented 2 years ago

I'd also suggest setting the Elixir logger level to :debug in prod to see if there are any logs from otel there.

smashedtoatoms commented 2 years ago

I have tried it with and without port 443. Even if that was the issue, I suspect the stdout logger doing nothing means that the problem is before that would have a chance to fail. I did set the logger level to :debug and it logged the node syncing log messages, but not much else. I also built a prod release, packaged it into the docker container that runs in fly, and ran the docker container locally, and it works fine and logs opentelemetry data. Here are my deps:

  defp deps do
    [
      {:credo, "~> 1.6", only: [:dev, :test], runtime: false},
      {:phoenix, "~> 1.6.10"},
      {:phoenix_html, "~> 3.0"},
      {:phoenix_live_reload, "~> 1.2", only: :dev},
      {:phoenix_live_view, "~> 0.17.5"},
      {:floki, ">= 0.30.0", only: :test},
      {:phoenix_live_dashboard, "~> 0.6"},
      {:esbuild, "~> 0.4", runtime: Mix.env() == :dev},
      {:swoosh, "~> 1.3"},
      {:telemetry_metrics, "~> 0.6"},
      {:telemetry_poller, "~> 1.0"},
      {:gettext, "~> 0.18"},
      {:jason, "~> 1.2"},
      {:plug_cowboy, "~> 2.5"},
      {:opentelemetry_exporter, "~> 1.0"},
      {:opentelemetry, "~> 1.0"},
      {:opentelemetry_api, "~> 1.0"},
      {:opentelemetry_liveview, "~> 1.0.0-rc.4"},
      {:opentelemetry_phoenix, "~> 1.0"},
    ]
  end

I spun this into the default phoenix app scaffold and can duplicate the behavior. It can be found here. Please excuse the profanity. I was so excited to use elixir on fly, and this whole experience has completely taken the wind out of my sails. Thanks for even looking at this. I suspect it has nothing to do with anything on your end.

tsloughter commented 2 years ago

@smashedtoatoms see what happens if you set opentelemetry: :permanent,

Maybe it is crashing and not showing the reason because of this being set to temporary.

I'll need to update the docs to make clear the problems of using the temporary mode.

tsloughter commented 2 years ago

I'd also make sure you are getting the latest versions, 1.1.0 of the api, 1.1.1 of opentelemetry (the SDK), and 1.2.0 of the exporter hex packages.

tsloughter commented 2 years ago

Oh, nm, I see in your example that it is locking to those versions. So I guess best to try the permanent thing to hopefully see where it is crashing, if it is.

smashedtoatoms commented 2 years ago

Yea, I updated it just to make sure. I set that to permanent last night as well, but I tried it again just to be sure. It doesn't seem to make a difference. Here are the debug logs on startup:

2022-09-11T19:55:31Z runner[68b8cf47] dfw [info]Starting instance
2022-09-11T19:55:31Z runner[68b8cf47] dfw [info]Configuring virtual machine
2022-09-11T19:55:31Z runner[68b8cf47] dfw [info]Pulling container image
2022-09-11T19:55:32Z runner[68b8cf47] dfw [info]Unpacking image
2022-09-11T19:55:33Z runner[68b8cf47] dfw [info]Preparing kernel init
2022-09-11T19:55:33Z runner[68b8cf47] dfw [info]Configuring firecracker
2022-09-11T19:55:33Z runner[68b8cf47] dfw [info]Starting virtual machine
2022-09-11T19:55:33Z app[68b8cf47] dfw [info]Starting init (commit: 249766e)...
2022-09-11T19:55:33Z app[68b8cf47] dfw [info]Preparing to run: `/app/bin/server` as nobody
2022-09-11T19:55:33Z app[68b8cf47] dfw [info]2022/09/11 19:55:33 listening on [fdaa:0:869c:a7b:cf99:68b8:cf47:2]:22 (DNS: [fdaa::3]:53)
2022-09-11T19:55:35Z app[68b8cf47] dfw [info]19:55:35.930 [info] Running UghWeb.Endpoint with cowboy 2.9.0 at :::8080 (http)
2022-09-11T19:55:35Z app[68b8cf47] dfw [info]19:55:35.932 [info] Access UghWeb.Endpoint at https://crimson-smoke-6983.fly.dev
2022-09-11T19:56:21Z app[68b8cf47] dfw [info]19:56:21.266 request_id=FxPmjSITXxx5z5EAAACx [info] GET /
2022-09-11T19:56:21Z app[68b8cf47] dfw [info]19:56:21.267 request_id=FxPmjSITXxx5z5EAAACx [debug] Processing with UghWeb.PageController.index/2
2022-09-11T19:56:21Z app[68b8cf47] dfw [info]  Parameters: %{}
2022-09-11T19:56:21Z app[68b8cf47] dfw [info]  Pipelines: [:browser]
2022-09-11T19:56:21Z app[68b8cf47] dfw [info]19:56:21.271 request_id=FxPmjSITXxx5z5EAAACx [info] Sent 200 in 4ms
tsloughter commented 2 years ago

I guess I'll setup fly because this baffles me :)

smashedtoatoms commented 2 years ago

Hahaha, what if that's their plan all along?

tsloughter commented 2 years ago

Hah!

Well my only success is to end with asking the question, why do I not have to set a SECRET_KEY_BASE when deploying to fly but I do when running the same Dockerfile image locally?

That made me worry they aren't actually running the same thing somehow? Or does it know to create that environment variable?

smashedtoatoms commented 2 years ago

If you run fly secrets list, you'll see it there. It is added automatically when you run fly launch and it detects that it is a phoenix project. The magic of fly.io is typically REALLY underwhelming when you find out what they're doing, and I mean that in the best way possible. It's part of what is making this whole thing so frustrating for me.

smashedtoatoms commented 2 years ago

So I just got a VPN hooked to my private network on Fly.io and opened up Observer against the instance of my app running in Fly... opentelemetry is running. Have I mentioned I have no idea what is going on?

Screen Shot 2022-09-11 at 10 12 19 PM

tsloughter commented 2 years ago

@smashedtoatoms can you check ets tables? There will be tables having spans inserted if the batch processor is working.

smashedtoatoms commented 2 years ago
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)1> :ets.all()     
[:logger, :ac_tab, #Reference<0.1312889380.1588723713.262071>, :code_names,
 :inet_db, :inet_cache, :inet_hosts_byname, :inet_hosts_byaddr,
 :inet_hosts_file_byname, :inet_hosts_file_byaddr, :inet_sockets,
 #Reference<0.1312889380.1588723713.262094>, :global_locks, :global_names,
 :global_names_ext, :global_pid_names, :global_pid_ids,
 :global_lost_connections, #Reference<0.1312889380.1588723713.262103>,
 :sys_dist, :file_io_servers, :elixir_config, :elixir_modules, :ssl_pem_cache,
 #Reference<0.1312889380.1588985857.99>,
 #Reference<0.1312889380.1588985857.100>,
 #Reference<0.1312889380.1588985857.101>,
 #Reference<0.1312889380.1588985857.102>,
 #Reference<0.1312889380.1588985857.103>,
 #Reference<0.1312889380.1588985857.104>, :dtls_listener_sup, :gproc,
 :gproc_monitor, :channels_table, :httpc_manager__session_db,
 :httpc_manager__handler_db, #Reference<0.1312889380.1588985857.135>,
 :tls_certificate_check_shared_state, :otel_export_table1, :otel_export_table2,
 :otel_span_table, :telemetry_handler_table, Plug.Crypto.Keys, Plug.Keys,
 Plug.Upload.Dir, Plug.Upload.Path, Phoenix.PubSub, :ranch_server,
 :cowboy_clock, Ugh.PubSub, ...]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)5> :ets.tab2list(:otel_span_table)   
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)6> :ets.tab2list(:otel_export_table1)
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)7> :ets.tab2list(:otel_export_table2)
[]
smashedtoatoms commented 2 years ago

So if everything was working, would those empty lists have something in them?

tsloughter commented 2 years ago

@smashedtoatoms after creating a span there should be a span in otel_span_table for a few seconds because the default batch timeout is 5 seconds -- tho it isn't going to always be there since it could be inserted 4.99 seconds into the interval and so .01 seconds later it is gone.

tsloughter commented 2 years ago

Sorry, a little confusing there.

When a span is started it is in otel_span_table. After it ends it is added to either otel_export_table1 or otel_export_table2 where it stays until it goes through the exporter (which happens by default every 5 seconds).

smashedtoatoms commented 2 years ago

Ok, so I just set up a shell script to call the api twice a second and kicked it off and then just spammed those tables. Would you expectation be that I'd see something in there?

iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)1> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)2> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)3> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)4> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)5> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)6> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)7> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)8> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)9> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)10> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)11> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)12> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)13> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)14> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)15> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)16> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)17> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)18> :ets.tab2list(:otel_span_table) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)19> :ets.tab2list(:otel_span_table)
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)20> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)21> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)22> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)23> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)24> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)25> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)26> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)27> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)28> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)29> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)30> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)31> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)32> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)33> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)34> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)35> :ets.tab2list(:otel_export_table1) 
[]
iex(crimson-smoke-6983@fdaa:0:869c:a7b:cf99:f553:20aa:2)36> :ets.tab2list(:otel_export_table1)
smashedtoatoms commented 2 years ago

Ok, I tweaked the setup a little more to make it as easy as I can to reproduce the problem. If you pull the latest from here you will get a version that will build a docker image with a release in it. It can be run locally or be deployed to fly with no modifications. If you make a request against the local one, you will get a span printed to stdout. If you make a request against the one running on fly.io, there will be no span printed to stdout. If you’re on Linux, you can use the fly deploy --local-only and guarantee that your local build and the one running in fly are identical. It works locally, it logs no spans on Fly.

Docker Commands

docker build -t please .
docker run -e SECRET_KEY_BASE='8+0M2NIsyUvHK+8z1NhtE++YfUAKFPJxoYxPXbmIR0WHb0vDxumlM5nCf39rVHNX' -e FLY_APP_NAME='local-fly-app' -it -p 4000:4000 please /app/bin/server

Fly.io Commands

Note: to run it on fly you'll need to change the app name on line 3 of the fly.toml file.

fly auth login
fly deploy 
fly deploy --local-only # if you're on linux and want to deploy the exact image that you're running locally
akasprzok commented 2 years ago

I replied to you on the fly forum, but here's easier -

fwiw I had traces to Honeycomb working without issue through fly, though I'm currently sending them to Grafana Cloud.

What I don't see in your setup is the x-honeycomb-dataset header that I thought was required by Honeycomb.

akasprzok commented 2 years ago

Maybe the config for temporaryhack.com might be useful?

smashedtoatoms commented 2 years ago

It isn't getting that far. I actually removed the entire exporter and am just writing the spans to stdout at this point. I wanted to remove networking as a possible thing to troubleshoot.

On Mon, Sep 12, 2022 at 9:23 PM, Andreas Kasprzok @.***> wrote:

I replied to you on the fly forum, but here's easier -

fwiw I had traces to Honeycomb working without issue through fly, though I'm currently sending them to Grafana Cloud.

What I don't see in your setup is the x-honeycomb-dataset header that I thought was required by Honeycomb.

— Reply to this email directly, view it on GitHub https://github.com/open-telemetry/opentelemetry-erlang/issues/458#issuecomment-1244814364, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNZQUPILFHHSOQ7SDUIYNLV57QTPANCNFSM6AAAAAAQJSS7FE . You are receiving this because you were mentioned.Message ID: @.***>

akasprzok commented 2 years ago

Could just silently be dropping the response from HC.

IIRC I had some silent errors before setting that header to a non-existent dataset, but the API key didn't have permission to create said dataset.

Have you tried the pollinators slack? That's probably the best place to find help for your problem.

akasprzok commented 2 years ago

NVM, I can't read good. You're saying it's not even spitting out the stdout traces when deployed to fly. That's odd.

smashedtoatoms commented 2 years ago

HAHAHA, yea, it's super weird. I have it set up at this point where I can run it locally and see the spans log to stdout, deploy to fly, and get nothing. It's really weird. It's still very likely to me that something is misconfigured, I just can't figure out what.

On Mon, Sep 12, 2022 at 9:31 PM, Andreas Kasprzok @.***> wrote:

NVM, I can't read good. You're saying it's not even spitting out the stdout traces when deployed to fly. That's odd.

— Reply to this email directly, view it on GitHub https://github.com/open-telemetry/opentelemetry-erlang/issues/458#issuecomment-1244818439, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNZQUN7X2EPPDULL5UMFSDV57RRFANCNFSM6AAAAAAQJSS7FE . You are receiving this because you were mentioned.Message ID: @.***>

dvic commented 2 years ago

Have you tried running locally as a release? Can you reproduce the problem like that?

tsloughter commented 2 years ago

@dvic the problem does not occur even if running the release in the same docker image that runs in fly.io.

I built the docker image locally and ran it and everything works fine but when I deploy it to fly.io I see no spans, neither in the console or when trying to honeycomb (both console and honeycomb work locally).

dvic commented 2 years ago

@dvic the problem does not occur even if running the release in the same docker image that runs in fly.io.

I built the docker image locally and ran it and everything works fine but when I deploy it to fly.io I see no spans, neither in the console or when trying to honeycomb (both console and honeycomb work locally).

Ah I see. Just to be sure, this happened to me more than once, are you 100% positive the http call is arriving on fly? (correct app, etc.)

smashedtoatoms commented 2 years ago

Hey @dvic, Thanks for taking a look. It's consistently failing. If you're using these instructions you can see it for yourself. I haven't seen it drop any messages outside of when I was messing with scaling.

smashedtoatoms commented 2 years ago

So Jon Anderson in the Honeycomb Pollinators Slack channel suggested setting the opentelemetry setting to this:

config :opentelemetry,
  sampler: :always_on,
  span_processor: :batch,
  traces_exporter: {:otel_exporter_stdout, []}

I did that, and it works both locally and in fly. The magic is the sampler: :always_on. Do you have any idea why the default might not work in the container when on fly but works in the container locally?

smashedtoatoms commented 2 years ago

Alright, I think I know what is going on. As of a month ago, fly-proxy injects the following headers in requests to apps running on fly: traceparent, tracestate

These are standard - It looks like the http lib decides to specifically look for those headers, extract them, and set the current otel context to that “remote otel context” it just extracted from the headers.

Since the default setting is “parent-based sampling”, and the fly default sample rate is <0.02%, the spans get dropped.

Basically, working as designed. Fly is going to make those headers opt-in it sounds like. Sorry for wasting so much of everyone's time.

tsloughter commented 2 years ago

@smashedtoatoms WHAT!? That is wild, haha, just completely wild...

tsloughter commented 2 years ago

And yea, always_on works because the default is parent_based+always_on. always_on by itself will ignore the traceparent and create a span even tho the parent is not sampled.

smashedtoatoms commented 2 years ago

They changed the injected headers to an internal header, so all is well. Thanks so much everyone!