Template errors result in long development/production response times

agjohnson commented 10 months ago

When a template that was included via {% include %} raises an exception, the request response time explodes. I originally noticed this in local development, but I believe that in https://github.com/readthedocs/readthedocs.org/issues/11061 we might have confirmed that this is a bug in production as well (so far maybe only affecting the beta instance).

In local development, I noticed:

Postgres container eats up 100% CPU doing what looks like repeated selects on the project table
The browser takes 60s+ to respond, or times out
The Docker logs for web container even take about 30-60s to output the stack trace

There is nothing special about my set up though, and my database is 27M.

In debugging https://github.com/readthedocs/readthedocs.org/issues/11061 I noticed:

The database query takes 20 times longer to execute
The response time is over 10s, but is not all DB time
Profiling shows most of the time spent in logging and related modules
I was not able to confirm where code time was spent in production transactions

To reproduce:

Open a template that is included in another template via {% include %}
Throw a template exception from the included template
- Remove load i18n from the template but add/retain a translated string
- Access a missing attribute, ie project.foo.bar
Load the view that would trigger the template
Notice long response time and long query time in django debug toolbar

agjohnson commented 10 months ago

@humitos Have you noticed this at all?

humitos commented 10 months ago

Yes, I've experienced this as well. My guess is because how we are formatting the exceptions in the console using structlog. I've seen the console outputting a lot of text due that all the traceback is formatted with nice colors, all the context variables and more. I found myself this pretty useful, but sometimes pretty distracting and annoying to understand the error... There is a trade-off there.

humitos commented 10 months ago

@agjohnson are you able to reproduce the issue after commenting this line? https://github.com/readthedocs/readthedocs.org/blob/main/readthedocs/settings/base.py#L866

agjohnson commented 10 months ago

Okay, well at least I'm not the only one :sweat_smile:

I found myself this pretty useful, but sometimes pretty distracting and annoying to understand the error

I really like the formatting and especially the additional context/locals information, but the stack trace is way to long to be useful usually, at least when working with templates. It's hard to find something useful in the stack trace when the exception starts in the template rendering (which was also always sort of the case with Django though).

are you able to reproduce the issue after commenting this line?

I am, though without the structlog rendering of the stack trace, the default stack trace isn't as delayed. But it still takes a long time for a response and trace, and the database is still getting hit hard with queries.

Edit: actually, the structlog change does seem to help some too, I get 504 responses very frequently instead without that change. I think the culprit is probably the DB CPU ultimately though.

agjohnson commented 10 months ago

The database seems to be hit a project query repeatedly:

If I fix the template error and reload, this query is executed, but only 2 times and response time is ~1ms

agjohnson commented 10 months ago

Ah I learned a new thing about debug toolbar! You can inspect/see the code and stack around the query in the debug toolbar SQL pane. This is a clue. When working properly, the query shows the template code that calls the query. When there is an exception raised, this is the stack around the query being executed repeatedly:

I disabled cacheops and the same thing happens, so it's not that layer. So seems next candidate is Django debug output?

Edit: disabling DEBUG does not fix this either. Now I'm really confused.

agjohnson commented 10 months ago

Yet another clue:

If I edit a top level template (projects/project_dashboard.html) and remove load i18n, the response is very quick
If I edit an include from this top level file (projects/partials/project_list.html) and remove load i18n, I notice the bug and the database is hit with looped queries.

agjohnson commented 10 months ago

I found this discussion, and hoped it was related, but so far nothing there seems to help:

https://forum.djangoproject.com/t/changing-queryset-repr-to-avoid-unexpected-queries/21272/22?page=2

I've disabled debug toolbar, logging, and debug mode entirely. This must be something related to include or local vars there.

agjohnson commented 8 months ago

I believe https://github.com/readthedocs/readthedocs.org/issues/11061 is replicating this behavior in production, but waiting to confirm. If confirmed, we should probably get a fix for this issue on the roadmap.

agjohnson commented 8 months ago

Noted here:

https://github.com/readthedocs/readthedocs.org/issues/11061#issuecomment-1918352645

The PR fixing 11061 confirms (to me) that this bug is affecting production as well. At very least the beta instance is affected by the bug. I haven't tested the current dashboard in production yet.

I'll update the description here, but the bug is that a template that is evaluated via {% include %} and throws a template exception will cause an abnormal response time, and will time out in production.

I've escalated this to a bug on our roadmap. Priority is medium -- the effect is really bad but it's only triggered when templates trigger exceptions.

humitos commented 8 months ago

I'm not sure if it's related or not, but I think it's worth to note that we are using cached.Loader in production but not on development: https://github.com/readthedocs/readthedocs.org/blob/505ee4c3030f00bac16068ddec6b817adc8e7a54/readthedocs/settings/base.py#L397-L411

humitos commented 7 months ago

To reproduce:

Open a template that is included in another template via {% include %}

Throw a template exception from the included template

Remove load i18n from the template but add/retain a translated string

Access a missing attribute, ie project.foo.bar

Load the view that would trigger the template

Notice long response time and long query time in django debug toolbar

I tried this by removing the load i18n from projects/partials/header.html (ext-theme) and I got a ~2.5s response: Elapsed time 2684.479 msec. It raised TemplateSyntaxError: Invalid filter: 'language_name_local'. Note that I'm --webpack --ext-theme.

Without using --webpack --ext-theme and removing the load i18n from core/project_bar.html I got a ~2.5s response as well: Elapsed time 2364.131 msec. It raised TemplateSyntaxError: Invalid block tag on line 6: 'trans', expected 'elif', 'else' or 'endif'. Did you forget to register or load this tag?

It seems I'm not able to reproduce this issue. Maybe there is something else going on here. I'm using just origin/main without any modification.

agjohnson commented 7 months ago

Your results sounds about right for reproducing locally. What does the page response time look like without errors?

In production, as the database is larger, response times from this are closer to 30-60s. Locally it's more like 2-4s, for me.

I noted a template to test here I doubt it will give different response times but probably worth testing the same thing.

Also, this is less just about the response time. Noted above, but the the underlying issue seems to be a loop in Django logic and in SQL queries. You can see this on DDT:

The SQL query seems to execute over and over, I see this locally and in production

humitos commented 7 months ago

I performed the test you mentioned and I have Elapsed time 4808.825 msec and default 1477.74 ms (217 queries including 212 similar and 212 duplicates )

However, without a template raising an exception: Elapsed time 742.940 msec and default 86.98 ms (101 queries including 94 similar and 62 duplicates )

humitos commented 7 months ago

I found the issue. It's what I guessed it could be in https://github.com/readthedocs/readthedocs.org/issues/10954#issuecomment-1856227684

Disabling the "nice logging output", I get a similar CPU times and only 12 SQL queries more when it returns 500 (I suppose these are performed to show the traceback in the Django 500 debug page).

I pushed this changes to https://github.com/readthedocs/readthedocs.org/commit/592c108af4cf7678421c00b02692081bc06a020c to show how to disable this nice logger.

agjohnson commented 7 months ago

Nice! I had tried a similar fix that you suggested (here though the permalink looks off), but it didn't help much. Maybe your fix above affects things at a deeper level.

We can test this rather easily. On web-extra we can replicate #11061 or something similar quite easily. With your fix, I would at least expect the response time to be better.

humitos commented 7 months ago

On web-extra we can replicate #11061 or something similar quite easily

Production uses a completely different set of loggers, tho. Mainly syslog, newrelic and sentry (see https://github.com/readthedocs/readthedocs-ops/blob/4872584858e229cdb33c57dd7381f9e1e1563746/salt/base/readthedocs/settings/base.py#L522-L632) --none of those are used on development.

However, they may be being affected by the same issue. Also note that this could be a structlog or django-structlog bug. We are not using the latest versions of them: https://github.com/readthedocs/readthedocs.org/blob/8b66ce9bb035c973c1e59d1c62bd91da826b6f71/requirements/pip.in#L164-L174

We may want to read their changelogs and give it another try to upgrade those packages before going deeper and deeper on debugging this. It may have fixed already.

humitos commented 7 months ago

We may want to read their changelogs and give it another try to upgrade those packages before going deeper and deeper on debugging this. It may have fixed already.

I created an initial PR for this change at #11166 -- we can use it to do some local QA and see if it works better.

humitos commented 7 months ago

Production uses a completely different set of loggers, tho. Mainly syslog, newrelic and sentry (see readthedocs/readthedocs-ops@4872584/salt/base/readthedocs/settings/base.py#L522-L632) --none of those are used on development.

Thinking a little more about this, I think all those SQL queries are executed by the sentry logger propbably, since it shows the exact same traceback with context variables (Stack trace: https://read-the-docs.sentry.io/issues/5019056329/?project=148442&query=is%3Aunresolved+%21message%3A%22%21message%3A%22%21message%3A%22SystemExit%22+%21message%3A%22frame-ancestors%22&referrer=issue-stream&statsPeriod=14d&stream_index=21#exception) that the Django 500 debug page. So, I don't think we will be able to remove those queries from here unless we remove that feature from Sentry logging.

agjohnson commented 7 months ago

This can be tested further on web-extra with a template change and dropping the settings for sentry et al

humitos commented 7 months ago

I tried removing sentry and forcing the load i18n issue on the template, but I had problems testing this because the ext-theme is installed using pip+git and it's not installing in an editable way. I attempt to uninstall it and re-installing it, but I broke something and the instance started using the old dashboard 🙃 .

I re-deployed this instance and gave it another try and I failed again. I may leave this for you @agjohnson in case you know how to test this in production.

I edited this file lib/python3.10/site-packages/readthedocsext/theme/templates/projects/partials/project_list.html on util01.org
I removed the sentry logging from readthedocs/settings/managed.py
supervisorctl restart all

I'm getting

Bad gateway Error code 502
Visit cloudflare.com for more information

Reverting the changes in the template, the instance works again.

agjohnson commented 7 months ago

I attempt to uninstall it and re-installing it, but I broke something and the instance started using the old dashboard

Yeah, I've encountered this, the package is not installed as editable. All that should be required is:

% pip uninstall readthedocsext-theme
% pip install -e ~/checkouts/ext-theme/
% # edit templates
% supervisorctl restart all

I think that should be enough to unblock you, so I'll leave you with assignment. But let me know if I can point you in the right direction if you're stuck.

humitos commented 7 months ago

Yeah, that's what I did and it didn't work.

agjohnson commented 7 months ago

I just tested this and it worked. :shrug:

I was able to edit the base template and the changes showed in the beta instance.

humitos commented 7 months ago

I may not be sure what I'm looking for here, then. With "it didn't work" I meant that I still got Bad gateway Error code 502, but maybe that's expected since there was an error in the application?

So, let's rewind a little bit here and assume that the steps from https://github.com/readthedocs/readthedocs.org/issues/10954#issuecomment-1976941744 are correct, settings and templates changes are picked correctly. Then, what I should do? or what I should be looking at?

agjohnson commented 7 months ago

I may not be sure what I'm looking for here, then. With "it didn't work" I meant that I still got Bad gateway Error code 502

I think I'm just confused by your reply. I understood "it didn't work" to mean that the application was still not starting or the new dashboard was not used. Do you have a working instance with the new dashboard when you make the change I posted above?

As long as the 502 response is a template error in the application logs, then it sounds like you have replicated the issue, no? I would also expect a very long response time before eventual timeout and Cloudflare returning the 502.

Have you had look into the issue I noted above?

11061

That issue has the same response you are describing. The underlying cause should be an excessive response time.

I edited this file lib/python3.10/site-packages/readthedocsext/theme/templates/projects/partials/project_list.html on util01.org

Why util01.org? I assume this is incorrect and you're testing this on ext or ext-theme, correct?

I removed the sentry logging from readthedocs/settings/managed.py

I'm not quite following your updates here, but it seems to me that you have replicated the issue I've described here and removing sentry just did not fix the issue -- the template exception is still resulting in a long response time and timeout at Cloudflare.

Does this not match what you are noticing?

humitos commented 7 months ago

I understood "it didn't work" to mean that the application was still not starting or the new dashboard was not used.

Oh, I see. The instance was starting and using the new dashboard properly. It was just giving 502, tho.

Do you have a working instance with the new dashboard when you make the change I posted above?

No, I was doing these changes on the only instance we have in the web-ext-theme ASG.

Why util01.org? I assume this is incorrect and you're testing this on ext or ext-theme, correct?

Yes, this was a "brain typo" 😄 . I'm testing on ext-theme.

humitos commented 7 months ago

Settings diff I'm using:

docs@web-ext-theme-i-0ae3319726327bdbb(org):~/checkouts/readthedocs.org$ diff -u readthedocs/settings/managed.py.orig readthedocs/settings/managed.py
--- readthedocs/settings/managed.py.orig    2024-03-05 11:12:44.246182303 +0000
+++ readthedocs/settings/managed.py 2024-03-05 11:13:21.966164143 +0000
@@ -443,33 +443,6 @@
             'formatter': 'newrelic',
         }

-        # Sentry logging
-        logging['formatters']['sentry'] ={
-            "()": structlog.stdlib.ProcessorFormatter,
-            "processors": [
-                structlog.stdlib.ProcessorFormatter.remove_processors_meta,
-                SentryProcessor(
-                    event_level=ERROR,
-                    as_context=False,
-                    tag_keys="__all__",
-                    ignore_loggers=[
-                        'django_structlog.celery.receivers',
-                        'django_structlog.middlewares.request',
-                        'django.security.DisallowedHost',
-                        'celery.worker.strategy',
-                        'celery.worker.consumer.mingle',
-                        'celery.worker.consumer.connection',
-                    ],
-                ),
-            ],
-            "foreign_pre_chain": shared_processors,
-         }
-        logging['handlers']['sentry'] = {
-            'level': 'ERROR',
-            'class': 'logging.handlers.SysLogHandler',
-            'formatter': 'sentry',
-        }
-
         logging['handlers']['syslog'] = {
             'level': 'INFO',
             'class': 'logging.handlers.SysLogHandler',
@@ -481,7 +454,7 @@
         }
         logging['loggers'] = {
             '': {
-                'handlers': ['console', 'syslog', 'sentry', 'newrelic'],
+                'handlers': ['console', 'syslog', 'newrelic'],
                 'level': 'DEBUG',
             },
             'django_structlog.celery.receivers': {
docs@web-ext-theme-i-0ae3319726327bdbb(org):~/checkouts/readthedocs.org$

Theme diff:

docs@web-ext-theme-i-0ae3319726327bdbb(org):~/checkouts/ext-theme$ git diff
WARNING: terminal is not fully functional
Press RETURN to continue 
diff --git a/readthedocsext/theme/templates/projects/partials/project_list.html b/readthedocsext/theme/templates/projects/partials/project_list.html
index 86a22f3..0e3d716 100644
--- a/readthedocsext/theme/templates/projects/partials/project_list.html
+++ b/readthedocsext/theme/templates/projects/partials/project_list.html
@@ -1,6 +1,5 @@
 {% extends "includes/crud/table_list.html" %}

-{% load i18n %}
 {% load humanize %}

 {% load privacy_tags %}
docs@web-ext-theme-i-0ae3319726327bdbb(org):~/checkouts/ext-theme$

With this configuration and running supervisorctl restart all I'm getting a 502. However, I'm not seeing anything in the logs under /var/log/readthedocs/*. Note that I'm able to access https://beta.readthedocs.org/support/ for example, which is a page that doesn't depend on the project_list.html file --so, the instance is up and running 👍🏼

humitos commented 7 months ago

I left only console logging handler in production and I'm still receiving the 502. So, this may not be related to logging as I originally suggested/thought 🤷🏼 . I disabled logging completely with https://docs.djangoproject.com/en/5.0/ref/settings/#std-setting-LOGGING_CONFIG and still getting 502.

I ran out of ideas here 😕 . Something else may be happening here.

agjohnson commented 7 months ago

Just to clarify, when you say 502, are you noticing a long response time too?

I think I debugged the build list issue with pyinstrument in production, or if not, it might be worth a shot. It's easy to plug in to Django, though I forget the exact steps.

humitos commented 7 months ago

Just to clarify, when you say 502, are you noticing a long response time too?

Yes, I'm getting the Cloudflare 502 error page and it takes some time to display that page. However, I can't say for sure that's more time that just loading the dashboard when it works, since it takes ~6s from my side to get the initial HTML page.

agjohnson commented 7 months ago

Hah yeah, that page response is pretty bad already. Is there a difference in load time if you don't apply your fix versus if you do apply your fix?

ericholscher commented 6 months ago

Waiting on a meeting to discuss possible issues.

agjohnson commented 6 months ago

We have a theory now for why this is happening, but it's still proving really hard to verify.

@humitos noticed that many of the model __str__ methods include relation lookups -- for example the Project model returns self.users.username. On the template error page, and perhaps also in Sentry or New Relic, these relations would be evaluated as part of fetching the local context in the stack trace.

I am assuming that if this is the error that we should be able to produce the timeout behavior in the normal dashboard as well, on web-extra using similar template exceptions.

What I'm a little confused by is that I can no longer reproduce the error this PR solved: https://github.com/readthedocs/ext-theme/pull/272/files -- accessing a build.version.verbose_name inside an include when the build is build.version = None doesn't produce an exception anymore. Perhaps some template handling has changed recently?

agjohnson commented 6 months ago

Ah, nevermind. I was not testing the fix exactly and default is a special case and does throw an exception on AttributeError.

So, speaking to my comment above, I am not seeing a huge difference on the legacy dashboard between a view with an exception and a view without an exception. The extra queries seem very likely to be linked to our __str__ methods, but it's not clear if that alone is causing the timeout.

agjohnson commented 6 months ago

And finally, with the above clues and some temporary patches to various __str__ methods, I would expect the beta dashboard to reduce the number of queries when hitting an exception in templates, but this is not the case.

Without an exception

With an exception

With the default:build.version.verbose_name bug reimplemented

The SQL queries are still deeply duplicated:

And theses queries still come from __repr__/repr()/pprint and seemingly just from the local context evaluation:

So it seems that both the current __str__ methods seem like a bug, and something else is happening on beta.

agjohnson commented 6 months ago

So! This conversation looked very related:

https://groups.google.com/g/django-developers/c/WzRZ0IfBipc

It mentions much of the same issues, and got me thinking that we didn't try patching any of these QuerySet/Manager repr methods yet. So, I have two patches live in beta currently:

diff --git a/readthedocsext/theme/templates/includes/elements/chips/build.html b/readthedocsext/theme/templates/includes/elements/chips/build.html
index 0d48789..7d90511 100644
--- a/readthedocsext/theme/templates/includes/elements/chips/build.html
+++ b/readthedocsext/theme/templates/includes/elements/chips/build.html
@@ -30,7 +30,12 @@
   {% comment %}
     This block is the same as the chip_text block above.
   {% endcomment %}
-  {% firstof text build.get_version_name %} #{{ build.id }}
+  {% if request.user and request.user.is_staff %}
+    {# This is an intentional exception #}
+    {{ text|default:build.version.verbose_name }}
+  {% else %}
+    {% firstof text build.get_version_name %}
+  {% endif %}
 {% endblock popupcard_header %}

 {% block popupcard_right %}

diff --git a/readthedocs/projects/querysets.py b/readthedocs/projects/querysets.py
index cd612797d..92a21befd 100644
--- a/readthedocs/projects/querysets.py
+++ b/readthedocs/projects/querysets.py
@@ -14,6 +14,9 @@ class ProjectQuerySetBase(models.QuerySet):

     use_for_related_fields = True

+    def __repr__(self):
+        return "ProjectQuerySetBase"
+
     def _add_user_projects(self, queryset, user, admin=False, member=False):
         """Add projects from where `user` is an `admin` or a `member`."""
         projects = AdminPermission.projects(

The response times and query count from above were viewing the project listing where one of the project's latest build has build.version = None. I did this by picking a project/build and updating it then saving.

With the patches above in on the beta instance, I get the expected 500 error page from our application, with a response time of 800ms!

Locally, the CLI logging still eats up ~8s though, but the queries were reduced from 800 to 18:

I'm guessing that this is all due to the beta templates using includes and extends more, especially around listing views.

Overall, I think that the fix that should come out of this debugging is that:

Our model repr methods should not use foreign keys ever
Our queryset repr methods should not include query data in production at very least
Our repr methods should probably all be minimal, they are mostly used for CLI use and obviously affect production.

@humitos what do you feel about the above change? Is there a better way to fix this without most likely removing some amount of project queryset debugging information in Sentry?

humitos commented 6 months ago

@humitos what do you feel about the above change?

I'm 👍🏼 on this plan.

I'd like to have a comment on the code clearly explaining why we are overriding these __str__/__repr__ method and maybe linking to this issue for more context. I'm sure we will forget about this in the near future.

Also, I think we could have a setting (enabled on local development and disabled on production) to decide whether or not override these methods. This way we can still have good debugging information locally but simplified on production.

Also also, in production, we could detect if we are running under a Django shell and enable the debugging option so we can see nice rendered objects in the terminal 👍🏼

Is there a better way to fix this without most likely removing some amount of project queryset debugging information in Sentry?

I'm not sure. Ideally, Sentry should collect all this data in an async way after returning a 500 response to the user --but I don't think that's possible. I saw that Sentry has a queue to send the messages after collecting the data, tho.

At least, with the setting/configuration approach that I mentioned above, we can still have debugging information while debugging 😄

agjohnson commented 6 months ago

Yeah, I figured we'd want some conditional sort of output on these. All of that seems reasonable, though it might also make sense to start minimal to fix the bug and see what Sentry/CLI/etc do and then jump back after to add the fancy conditional output after.

With debugging complete here, I'll hand this back for backend fixing, I won't jump into any further fixes here.

readthedocs / readthedocs.org

Template errors result in long development/production response times #10954

11061

Without an exception

With an exception