Closed victoriaatraft closed 1 month ago
one question about this: I noticed that clicking the errors is a lot faster in the lower environments than in production, even when a particular file has 200K+ errors. Would deleting/archiving the parser_errors records after a certain period, improve this experience?
FYSA -- poking around DAC in prod since sprint 109 release and am noticing less latency here. I'm curious if this has anything to do with having a bigger RDS? cc: @raftmsohani @elipe17
FYSA -- poking around DAC in prod since sprint 109 release and am noticing less latency here. I'm curious if this has anything to do with having a bigger RDS? cc: @raftmsohani @elipe17
The more data we have in the, the query will take longer, but I have done some improvement on that admin page. There are some other improvements we can do. And of course, newer postgres have improvement over fields like json, and how queries are done in general
On Oct 22nd, @ADPennington mentioned:
we are on site today, so tested 3 different ways to access list of 300K parser errors. 1) connected to hhs network 2) disconnected from hhs network but connected to vpn and 3) disconnected from network and disconnect from vpn and connected to internet via hotspot. 1) after 20-30 seconds, 404 page 2) same result as 1. 3) after about 8 seconds, page loads successfully
Following is the log from backend app:
09:34:58.559: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:34:58 +0000] "GET /admin/data_files/datafile/ HTTP/1.0" 200 190717 "https://tanfdata.acf.hhs.gov/admin/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:35:37.615: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:35:37 +0000] "GET /admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48 HTTP/1.0" 200 168655 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:37:22.939: [APP/PROC/WEB.0] [2024-10-22 13:37:22 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:564)
09:37:23.963: [APP/PROC/WEB.0] [2024-10-22 13:37:23 +0000] [7] [WARNING] Worker with pid 564 was terminated due to signal 9
09:37:23.966: [APP/PROC/WEB.0] [2024-10-22 13:37:23 +0000] [1158] [INFO] Booting worker with pid: 1158
09:38:20.016: [APP/PROC/WEB.0] [2024-10-22 13:38:20 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:565)
09:38:21.041: [APP/PROC/WEB.0] [2024-10-22 13:38:21 +0000] [7] [WARNING] Worker with pid 565 was terminated due to signal 9
09:38:21.044: [APP/PROC/WEB.0] [2024-10-22 13:38:21 +0000] [1159] [INFO] Booting worker with pid: 1159
09:39:03.037: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:39:03 +0000] "GET /admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48 HTTP/1.0" 200 168655 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:40:50.246: [APP/PROC/WEB.0] [2024-10-22 13:40:50 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:1158)
09:40:51.256: [APP/PROC/WEB.0] [2024-10-22 13:40:51 +0000] [7] [WARNING] Worker with pid 1158 was terminated due to signal 9
09:40:51.258: [APP/PROC/WEB.0] [2024-10-22 13:40:51 +0000] [1160] [INFO] Booting worker with pid: 1160
09:41:33.786: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:41:33 +0000] "GET /admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48 HTTP/1.0" 200 168655 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:41:45.659: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:41:45 +0000] "GET /admin/parsers/parsererror/?file=4887 HTTP/1.0" 200 120367 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:41:45.880: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:41:45 +0000] "GET /admin/jsi18n/ HTTP/1.0" 200 3195 "https://tanfdata.acf.hhs.gov/admin/parsers/parsererror/?file=4887" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:41:59.357: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:41:59 +0000] "GET /admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48 HTTP/1.0" 200 168655 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:42:17.007: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:42:16 +0000] "GET /admin/parsers/parsererror/?file=4887 HTTP/1.0" 200 120367 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:42:17.199: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:42:17 +0000] "GET /admin/jsi18n/ HTTP/1.0" 200 3195 "https://tanfdata.acf.hhs.gov/admin/parsers/parsererror/?file=4887" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:42:31.836: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:42:31 +0000] "GET /admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48 HTTP/1.0" 200 168655 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
09:42:46.292: [APP/PROC/WEB.0] 10.255.194.170 - - [22/Oct/2024:13:42:46 +0000] "GET /admin/parsers/parsererror/?file=4893 HTTP/1.0" 200 122349 "https://tanfdata.acf.hhs.gov/admin/data_files/datafile/?stt__id__exact=99&stt__id__exact=48" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
Hypothesis: One possible issue based on above logs is that the query is causing load on memory and consequently causing the app to hit memory limit threshold and crash.
Test: I tested this on develop, by browsing 200k parser errors, which took around 3 seconds, and no obvious memory increase and the query is limited to pagination number of records:
SELECT parser_error.id, parser_error.row_number, parser_error.field_name, parser_error.error_message,
parser_error.error_type, ....
FROM parser_error
WHERE parser_error.file_id = %s
ORDER BY parser_error.id DESC
LIMIT 100
One solution is to limit the number of records to 50 instead of 100 using list_per_page = 50
Looking at the logs from frontend (shown below), we see that the request from frontend to backend is being dropped, and then frontend responds with a 404 code because the backend is not available:
so we will need to find out why the backend dropped the request. If we look at the same time (9:38 am Oct 22), we see that a worker is terminated with signal 9, which means the memory was not available. This means backend was momentarily unavailable while the worker was using the memory and was being restarted.
Although the admin page list is limited to 100 queries and that should not put load on the memory, we can still reduce the memory consumption by the admin list page by limiting the number records
@vlasse86 @raftmsohani for posterity, can you note we closed cc: @ADPennington
Description: Currently when clicking into "Parser Errors" for a given row of the DAC Data Files page there is significant latency before the system returns results.
Delivers
Supporting Documentation: