JobRunner timeout during initial download

LachlanStuart commented 4 years ago

I don't believe this is necessarily a PyWren issue, so I'm documenting it here instead. The issue seems to be that the annotate step is starting too many parallel I/O operations, and some of PyWren's internal COS calls are failing because they have to compete with loading 100+ GB of dataset segments.

Here's the bug I'm seeing:

Under high load with IBM COS, PyWren's JobRunner fails during the initial download of the function and modules. This has been raised before: https://github.com/pywren/pywren-ibm-cloud/issues/217 . However, unlike that issue, in my case this hasn't been solved by changing to the Standard tier for COS.

Logs from the host: https://gist.github.com/LachlanStuart/adde7da2e19b6abddb4a30f9271da775

Logs from the failing invocation: https://gist.github.com/LachlanStuart/c44df4b841f7fadb40615999b93687df

The function being called is process_centr_segment

This timeout only seems to occur when the data being read by read_ds_segments inside the function is too large - I didn't see this issue with the huge dataset, but I see it often with the huge2 dataset. With smaller DBs it seems to succeed sometimes, but with larger DBs it consistently fails.

Note that I'm running with the code from https://github.com/metaspace2020/pywren-annotation-pipeline/pull/52 , because if I don't apply that fix then I just get unexplained OUTATIME errors.

gilv commented 4 years ago

@JosepSampe @omerb01 do you have an input please?

JosepSampe commented 4 years ago

Added IBM COS Request retrying in https://github.com/pywren/pywren-ibm-cloud/pull/249 It should prevent this issue as in https://github.com/metaspace2020/pywren-annotation-pipeline/pull/52

omerb01 commented 4 years ago

@LachlanStuart was this issue resolved?

LachlanStuart commented 4 years ago

@omerb01 Yes

metaspace2020 / Lithops-METASPACE

JobRunner timeout during initial download #53