pixiebrix / pixiebrix-extension

PixieBrix browser extension
https://www.pixiebrix.com
GNU Affero General Public License v3.0
83 stars 22 forks source link

[Spike] render_nunjucks error spike #9150

Open Pashaminkovsky opened 2 months ago

Pashaminkovsky commented 2 months ago

Spike details: https://www.notion.so/pixiebrix/Nunjucks-Error-Spike-0283abe59c6a48ad96f5f84715516ec3?showMoveTo=true&saveParent=true

Datadog dashboards:

test edit

mnholtz commented 2 months ago

Spent the afternoon looking into this and collected my observations on the notion document: https://www.notion.so/pixiebrix/Nunjucks-Error-Spike-0283abe59c6a48ad96f5f84715516ec3?showMoveTo=true&saveParent=true

I have not been able to identify root cause, but I propose creating a PR tomorrow that

  1. Increases timeout threshold for postMessage, e.g. from 3s to 5s
  2. Throw more specific errors/add additional logging to increase confidence that these are timeout issues, and not issues with the sandbox getting injected onto the page (I don't think the latter is the case, but it would be nice to have to rule-out that case for sure)

More involved solution/future work would be to come up with a different solution for rendering nunjucks templates, an approach I have not researched and have little context on atm

Pashaminkovsky commented 2 months ago

@mnholtz that sounds good I appreciate you looking into it. Re logging – is it possible to have visibility into the success rate of the mod run when the error is thrown?

mnholtz commented 1 month ago

that sounds good I appreciate you looking into it. Re logging – is it possible to have visibility into the success rate of the mod run when the error is thrown?

Discussed over zoom - the answer is that how much of the mod is able to be run/will run on a simple page refresh/button click/etc. is determined on a per-mod basis based on the mod definition. Generally speaking this error is very disruptive to mod execution and would usually result in the user modifying their workflow in some way to overcome the issue

mnholtz commented 1 month ago

Tentatively putting this back into the backlog until we can get more logging visibility/information from https://github.com/pixiebrix/pixiebrix-extension/pull/9168

mnholtz commented 1 month ago

It looks like increasing the timeout to 5s did not make any impact on the errors: https://app.datadoghq.com/logs?query=issue.id%3A6f1c3b50-4bb1-11ef-a033-da7ad0900002%20%2A%3A%22%22%20&agg_m=count&agg_m_source=base&agg_t=count&cols=host%2Cservice%2C%40error.causes%5B1%5D&fromUser=true&messageDisplay=inline&refresh_mode=sliding&storage=hot&stream_sort=service%2Casc&viz=stream&from_ts=1726583812591&to_ts=1727188612591&live=true

And the cause is still mixed; doesn't seem to be an issue with the iframe getting injected onto the page (not a single one of these errors was caused by the new IframeInjectionError), but loading.

mnholtz commented 1 month ago

After further investigation with @grahamlangford, we've identified the following additional hypotheses (both of which are documented on the original notion document):

  1. Large payloads sent to the sandbox to handle are causing users to hit the timeout limit for both the messages with the large payloads, and other in-flight messages in the queue
  2. iframes trigger load events even in the case where they've errored on initialization. It's possible that the iframe is erroring for some reason, and we are trying to message a bad iframe instance without knowing

Following this, we have identified the following next steps for a follow-up PR:

  1. Keep a tally of the current in-flight messages and their payload size to include in the error message, with the goal of validating hypothesis no. 1
  2. Retry iframe injection altogether rather than retrying on just messaging the sandbox iframe
mnholtz commented 1 month ago

Adding re-injection in https://github.com/pixiebrix/pixiebrix-extension/pull/9208 did not seem to have an effect.

Additional information added to error telemetry did not come through to datadog due to the way datadog selects error properties to report (oversight on my part). Will be fixing this in a follow-up commit.

fungairino commented 3 weeks ago

One avenue for us to pursue is to move the sandbox to the offscreen page to avoid issues related to injecting the iframe during brick execution runtime.

POC: https://github.com/pixiebrix/pixiebrix-extension/pull/9352

fungairino commented 3 weeks ago

Another avenue that requires less of an architecture shift is to pre-emptively inject the sandbox onto the page upon content script initialization rather than waiting until the point where a brick requiring it is executing

fungairino commented 2 weeks ago

I merged in some pre-work in the above PR which will make it easier to extend the offscreen page to inject sandbox execution.

Pausing on this for now until we see how the current release does with render_nunjucks errors

fregante commented 1 week ago

Before you moved to the offscreen document, have you seen any initialization errors in the sidebar as well? During the chrome.sidePanel migration we attributed those errors to extension reloads but I don't know if they also happened on production.

grahamlangford commented 1 week ago

Before you moved to the offscreen document, have you seen any initialization errors in the sidebar as well? During the chrome.sidePanel migration we attributed those errors to extension reloads but I don't know if they also happened on production.

I haven't seen anything in our logging, and we haven't had any issues reported in a long time