We write log chunks to cloud storage every ~10 seconds, but we SKIP writing a log file if there were no log lines produced in that time period. When we request logs from the dashboard, we request them with a limit for the number of lines (2000). We will traverse the files in storage until we get to 2000 lines, so long as there are more files available for traversal. If your logging is dense, you will get to 2k lines quickly. If your logging is sparse in the sense that you only get messages every few minutes, but when those messages do show up there are at least several lines, then you get to 2k lines fairly quickly. The worst case scenario for how many remote log files you must traverse is if your logging produces a small number of lines every several seconds (absolute worst case would be one line every 10 seconds).
This PR adds a time limit such that even if 2000 lines haven't been found yet, and there are more files to check, we will still stop after 10 seconds of searching and return whatever we have. However, in order to make sure the cursor makes some progress with each call, and the user gets something if possible, we won't do an early return if NO lines have yet been found.
Testing
Made a run where the "log flush" interval was only 1 second, with a single log line produced every second. This ensures the kind of "worst case" sparsity this PR is targeting. Run is here.
Tried the following with that run:
while it had been running a while, visit log tab. Ensure there are lines shown within a few seconds
try loading next log lines. Ensure more log lines are shown
try jumping to the end and ensuring log lines are shown
try jumping to the end and ensuring log lines are shown
try "loading next" after jumping to the end and ensure more log lines are shown
try adding filters with varying degrees of rareness (ex: 58. to catch ~1 line for every 100s of execution, 857. to catch roughly every 1000s of execution). Ensured that the search lasted until at least one line was shown OR the search request timed out.
Note that it is still possible to get the search request to time out if you use a filter for something that is incredibly rare (or nonexistent) and traversing all the existing log files takes more time than the request is allowed. However, this is probably a somewhat acceptable failure mode (at any rate, solving it would probably require that we create a search index for the logs or some other such highly complex solution). If you don't use any filtering, you should always get back at least one line or be told (truthfully) that there are no lines yet.
We write log chunks to cloud storage every ~10 seconds, but we SKIP writing a log file if there were no log lines produced in that time period. When we request logs from the dashboard, we request them with a limit for the number of lines (2000). We will traverse the files in storage until we get to 2000 lines, so long as there are more files available for traversal. If your logging is dense, you will get to 2k lines quickly. If your logging is sparse in the sense that you only get messages every few minutes, but when those messages do show up there are at least several lines, then you get to 2k lines fairly quickly. The worst case scenario for how many remote log files you must traverse is if your logging produces a small number of lines every several seconds (absolute worst case would be one line every 10 seconds).
This PR adds a time limit such that even if 2000 lines haven't been found yet, and there are more files to check, we will still stop after 10 seconds of searching and return whatever we have. However, in order to make sure the cursor makes some progress with each call, and the user gets something if possible, we won't do an early return if NO lines have yet been found.
Testing
Made a run where the "log flush" interval was only 1 second, with a single log line produced every second. This ensures the kind of "worst case" sparsity this PR is targeting. Run is here.
Tried the following with that run:
58.
to catch ~1 line for every 100s of execution,857.
to catch roughly every 1000s of execution). Ensured that the search lasted until at least one line was shown OR the search request timed out.Note that it is still possible to get the search request to time out if you use a filter for something that is incredibly rare (or nonexistent) and traversing all the existing log files takes more time than the request is allowed. However, this is probably a somewhat acceptable failure mode (at any rate, solving it would probably require that we create a search index for the logs or some other such highly complex solution). If you don't use any filtering, you should always get back at least one line or be told (truthfully) that there are no lines yet.