Closed LachlanStuart closed 4 years ago
@JosepSampe @omerb01 do you have an input please?
Added IBM COS Request retrying in https://github.com/pywren/pywren-ibm-cloud/pull/249 It should prevent this issue as in https://github.com/metaspace2020/pywren-annotation-pipeline/pull/52
@LachlanStuart was this issue resolved?
@omerb01 Yes
I don't believe this is necessarily a PyWren issue, so I'm documenting it here instead. The issue seems to be that the annotate step is starting too many parallel I/O operations, and some of PyWren's internal COS calls are failing because they have to compete with loading 100+ GB of dataset segments.
Here's the bug I'm seeing:
Under high load with IBM COS, PyWren's JobRunner fails during the initial download of the function and modules. This has been raised before: https://github.com/pywren/pywren-ibm-cloud/issues/217 . However, unlike that issue, in my case this hasn't been solved by changing to the Standard tier for COS.
Logs from the host: https://gist.github.com/LachlanStuart/adde7da2e19b6abddb4a30f9271da775
Logs from the failing invocation: https://gist.github.com/LachlanStuart/c44df4b841f7fadb40615999b93687df
The function being called is
process_centr_segment
This timeout only seems to occur when the data being read by
read_ds_segments
inside the function is too large - I didn't see this issue with thehuge
dataset, but I see it often with thehuge2
dataset. With smaller DBs it seems to succeed sometimes, but with larger DBs it consistently fails.Note that I'm running with the code from https://github.com/metaspace2020/pywren-annotation-pipeline/pull/52 , because if I don't apply that fix then I just get unexplained OUTATIME errors.