raft-tech / TANF-app

Repo for development of a new TANF Data Reporting System
Other
17 stars 4 forks source link

[Spike] Investigate latency when clicking into the parsing errors column on DAC data files page #3075

Closed victoriaatraft closed 1 month ago

victoriaatraft commented 4 months ago

Description: Currently when clicking into "Parser Errors" for a given row of the DAC Data Files page there is significant latency before the system returns results.

Delivers

Supporting Documentation:

image

robgendron commented 3 months ago

3137 may be related to this, probably not blocking.

ADPennington commented 3 months ago

one question about this: I noticed that clicking the errors is a lot faster in the lower environments than in production, even when a particular file has 200K+ errors. Would deleting/archiving the parser_errors records after a certain period, improve this experience?

ADPennington commented 1 month ago

FYSA -- poking around DAC in prod since sprint 109 release and am noticing less latency here. I'm curious if this has anything to do with having a bigger RDS? cc: @raftmsohani @elipe17

raftmsohani commented 1 month ago

FYSA -- poking around DAC in prod since sprint 109 release and am noticing less latency here. I'm curious if this has anything to do with having a bigger RDS? cc: @raftmsohani @elipe17

The more data we have in the, the query will take longer, but I have done some improvement on that admin page. There are some other improvements we can do. And of course, newer postgres have improvement over fields like json, and how queries are done in general

raftmsohani commented 1 month ago

On Oct 22nd, @ADPennington mentioned:

we are on site today, so tested 3 different ways to access list of 300K parser errors. 1) connected to hhs network 2) disconnected from hhs network but connected to vpn and 3) disconnected from network and disconnect from vpn and connected to internet via hotspot. 1) after 20-30 seconds, 404 page 2) same result as 1. 3) after about 8 seconds, page loads successfully

Following is the log from backend app:

09:34:58.559: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:34:58 +0000] "GET /admin/data_files/datafile/ HTTP/1.0" 200 190717 "https://tanfdata.acf.hhs.gov/admin/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:35:37.615: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:35:37 +0000] "GET /admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48 HTTP/1.0" 200 168655 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:37:22.939: [APP/PROC/WEB.0] [2024-10-22 13:37:22 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:564)
09:37:23.963: [APP/PROC/WEB.0] [2024-10-22 13:37:23 +0000] [7] [WARNING] Worker with pid 564 was terminated due to signal 9
09:37:23.966: [APP/PROC/WEB.0] [2024-10-22 13:37:23 +0000] [1158] [INFO] Booting worker with pid: 1158
09:38:20.016: [APP/PROC/WEB.0] [2024-10-22 13:38:20 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:565)
09:38:21.041: [APP/PROC/WEB.0] [2024-10-22 13:38:21 +0000] [7] [WARNING] Worker with pid 565 was terminated due to signal 9
09:38:21.044: [APP/PROC/WEB.0] [2024-10-22 13:38:21 +0000] [1159] [INFO] Booting worker with pid: 1159
09:39:03.037: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:39:03 +0000] "GET /admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48 HTTP/1.0" 200 168655 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:40:50.246: [APP/PROC/WEB.0] [2024-10-22 13:40:50 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:1158)
09:40:51.256: [APP/PROC/WEB.0] [2024-10-22 13:40:51 +0000] [7] [WARNING] Worker with pid 1158 was terminated due to signal 9
09:40:51.258: [APP/PROC/WEB.0] [2024-10-22 13:40:51 +0000] [1160] [INFO] Booting worker with pid: 1160
09:41:33.786: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:41:33 +0000] "GET /admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48 HTTP/1.0" 200 168655 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:41:45.659: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:41:45 +0000] "GET /admin/parsers/parsererror/?file=4887 HTTP/1.0" 200 120367 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:41:45.880: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:41:45 +0000] "GET /admin/jsi18n/ HTTP/1.0" 200 3195 "https://tanfdata.acf.hhs.gov/admin/parsers/parsererror/?file=4887" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:41:59.357: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:41:59 +0000] "GET /admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48 HTTP/1.0" 200 168655 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:42:17.007: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:42:16 +0000] "GET /admin/parsers/parsererror/?file=4887 HTTP/1.0" 200 120367 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:42:17.199: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:42:17 +0000] "GET /admin/jsi18n/ HTTP/1.0" 200 3195 "https://tanfdata.acf.hhs.gov/admin/parsers/parsererror/?file=4887" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:42:31.836: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:42:31 +0000] "GET /admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48 HTTP/1.0" 200 168655 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:42:46.292: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:42:46 +0000] "GET /admin/parsers/parsererror/?file=4893 HTTP/1.0" 200 122349 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
raftmsohani commented 1 month ago

Hypothesis: One possible issue based on above logs is that the query is causing load on memory and consequently causing the app to hit memory limit threshold and crash.

Test: I tested this on develop, by browsing 200k parser errors, which took around 3 seconds, and no obvious memory increase and the query is limited to pagination number of records:

SELECT parser_error.id, parser_error.row_number, parser_error.field_name, parser_error.error_message,
   parser_error.error_type, ....
FROM parser_error
WHERE parser_error.file_id = %s
ORDER BY parser_error.id DESC
LIMIT 100
raftmsohani commented 1 month ago

One solution is to limit the number of records to 50 instead of 100 using list_per_page = 50

raftmsohani commented 1 month ago

Looking at the logs from frontend (shown below), we see that the request from frontend to backend is being dropped, and then frontend responds with a 404 code because the backend is not available:

Screenshot 2024-10-24 at 9 57 49 AM

so we will need to find out why the backend dropped the request. If we look at the same time (9:38 am Oct 22), we see that a worker is terminated with signal 9, which means the memory was not available. This means backend was momentarily unavailable while the worker was using the memory and was being restarted.

Screenshot 2024-10-24 at 10 05 12 AM

Although the admin page list is limited to 100 queries and that should not put load on the memory, we can still reduce the memory consumption by the admin list page by limiting the number records

lfrohlich commented 1 month ago

@vlasse86 @raftmsohani for posterity, can you note we closed cc: @ADPennington