sassoftware / sas-airflow-provider

Apache Airflow Provider for creating tasks in Airflow to execute SAS Studio Flows and Jobs.
Apache License 2.0
18 stars 15 forks source link

Logging Not Full Downloaded When Log is Large #36

Open engmtcdrm opened 3 months ago

engmtcdrm commented 3 months ago

If you have a piece of code that takes less time to run than to download the log, the SASStudioOperator will not finish downloading the log. This is due to how the method _run_job_and_wait works. It will continue to loop until the status of the job changes out of running or pending (let's ignore the unknown logic for a moment). While it is looping it will call the method stream_log which will download the log for Airflow. However, if the job completes before the log can be downloaded, the log will be incomplete.

The SAS code to run to replicate this is rather simple, it spits out a ton of NOTEs to the log intentionally trying to see how far the log can be pushed before breaking. If the uncommented 500000 does not work to see the issue, just change it to a higher number and try again.

%macro test_log_lines();
    /* %do i = 1 %to 250000; */
    %do i = 1 %to 500000;
    /* %do i = 1 %to 1000000; */
    /* %do i = 1 %to 2000000; */
    /* %do i=1 %to 40000000; */
    /* %do i=1 %to 80000000; */
        %put NOTE: hi mom &i.!;

        /* data _null_;
            sleep(1);
        run; */
    %end;
%mend test_log_lines;

%test_log_lines;

The solution to this is to have a second check after the while loop on line 347 to check if the num_log_lines < job['logStatistics']['lineCount']. This logic should be a while loop replace the logic on line 379 because that will only grab the last 99999 lines of the log, not ALL the rest.

Lastly, here is a screen shot of the log from the code I ran above in Airflow UI. You can see how the log abruptly stops. image

rizhansas commented 3 months ago

Polling the job status in the operator is suboptimal. Implementing defer operator is a better choice. Because it frees up the precious CPU task slot and let the Triggerer takes care of polling.

We have experienced a similar log collection delay in a customized operator. A second check or a delay collection are two possible solution.