vatlab / sos

SoS workflow system for daily data analysis
http://vatlab.github.io/sos-docs
BSD 3-Clause "New" or "Revised" License
269 stars 45 forks source link

Change the destination of where failed script is written to #1530

Closed gaow closed 5 months ago

gaow commented 7 months ago

This is typically what we see when a job failed,

ERROR: [susie_twas_1]: [0]:
Failed to execute Rscript /home/aw3600/.sos/97de343d7da3f0ce/susie_twas_1_0_ff982a12.R
exitcode=1, workdir=/home/mambauser/data
---------------------------------------------------------------------------
[susie_twas]: Exits with 1 pending step (susie_twas_2)

The line Rscript /home/aw3600/.sos/97de343d7da3f0ce/susie_twas_1_0_ff982a12.R is how we track and try to reproduce the error, to debug. However, long story short is that we are working with cloud computing where /home/aw3600/.sos/ is a path in the VM that gets destroyed after a command ends. Although it is possible to copy the entire .sos folder to permanent AWS S3 bucket before the VM dies, it is non-trivial to sync the entire folder ... all we care is this file susie_twas_1_0_ff982a12.R.

I think this conversation was once brought up but I don't remember we have an option to do it yet -- can we specify something on the sos run interface to make these temporary scripts saved to a given folder? I like the behavior of:

R: ... , stdout =, stderr = 

which writes the stderr and stdout to where I want them. I wonder if we can add something like:

R: ..., stdout=, stderr=, debug_script="/path/to/debug/folder"

and only keep the scripts to /path/to/debug/folder when there is an issue -- and change the prompt Failed to execute to pointing to this script?

BoPeng commented 7 months ago

Cannot you -v outside_sos:/home/user/.sos to save the content outside of the VM?

gaow commented 7 months ago

The suggestions we get from the vendor team is to avoid any communications between VM and S3 directly especially for frequent file exchanges. That was why their setup is to write data only to specific folders pre-mounted which later gets automatically copied to S3. I guess we can try to mount it as you suggest and see if it is feasible for large embarrassing parallel workflows involving many signature checks. I will report it back here if it seems too slow.

BoPeng commented 7 months ago

Let me know if it works for you. It is pretty easy to add this setting from command line but the problem is that ~/.sos/config.yml is the default configuration path and moving ~/.sos away will force users to use this option all the time. It is possible to keep ~/.sos/ but move the command log to somewhere else though.

gaow commented 7 months ago

@BoPeng thank you! Currently we cannot test it properly because the way our vendor set it up is that the S3 bucket is mounted with read-only. The automatic process runs the sync separate from our SoS run so we don't really write there the real time. I am asking them to reconfigure it and am waiting for the response.

It is possible to keep ~/.sos/ but move the command log to somewhere else though.

I think all we would need is to move the failed script ("command log"?) elsewhere. That's what we are interested in. If that's written to a folder that we sync between the VM and S3 that'd be the best. We should be able to leverage that and test it out.

gaow commented 5 months ago

It is possible to keep ~/.sos/ but move the command log to somewhere else though.

@BoPeng I'm sorry, it turns out we do need this feature -- to keep ~/.sos where it is but set the failed script to write to somewhere else, perhaps the output folder? The problem is that we did try to mount S3 bucket to ~/.sos as the cache. However, the I/O is an issue and we encounter SQLite failure frequently for large jobs to the extend that we cannot get our analysis done this way .. We will need to keep ~/.sos local.

Can we change the default for SoS anyways to write the command log to the same folder as where people set stderr to be, and if they did not set it we still write to ~/.sos? Thank you in advance for helping with this emergency fix!

BoPeng commented 5 months ago

The problem is that the .sos/... has the complete script that can be executed by Rscript .... etc, but stderr is a file with other content, and cannot be directly executed. Would not be enough if we write the content of the script also to stderr?

BoPeng commented 5 months ago

Let me see if #1533 works.

gaow commented 5 months ago

Thank you @BoPeng it's very helpful pointer to where to modify it. I have change it to:

https://github.com/vatlab/sos/pull/1533/files

It seems good. For example for this script:

[1]
R: 
  print(hello)

it says:

Failed to execute Rscript /home/gw/.sos/aee75cfb40461b96/1_0_a0afcf75.R
exitcode=1, workdir=/home/gw/tmp/04-Feb-2024

but when i set stderr file explicitly:

[1]
R: stderr="file.txt"
  print(hello)

the temporary script gets into the current folder properly:

Failed to execute Rscript /home/gw/tmp/04-Feb-2024/1_0_9b3f78e8.R
exitcode=1, workdir=/home/gw/tmp/04-Feb-2024, stderr=file.txt

Do you see any obvious issues with this patch? Please improve as you see fit. I wonder if we can also release a new version to conda for us to pull the changes and apply it to our jobs on AWS. Thanks!

BoPeng commented 5 months ago

How about a combination of both patches? I think having the script directly in stderr can be useful especially when stderr is the default (sys.stderr).

BoPeng commented 5 months ago

If we do that, for consistency and convenient, we should also write the script to

https://github.com/vatlab/sos/blob/23bd5e95f977260036b23e19a7cd782548bc8ce3/src/sos/task_executor.py#L311

This is because sometimes it is not entirely clear what went wrong with a task when it fails due to variable replacement problem.

gaow commented 5 months ago

@BoPeng I brought the other patch back via: https://github.com/vatlab/sos/commit/ffeac9cb300a3e247c48598649c5118014a8a4a2 instead of writing the entire script, because the script can be very long in many applications.

I think by placing the error message into stderr, it should also reflect into the task status so we don't have to modify task_executor.py?

The patch does not work as is, however. The error message is

TypeError: a bytes-like object is required, not 'str'

I think this is because you opened the stderr file with b option so we need to se.write a byte object not str? I'm not sure how to do that. I tried to wrap it around with encode_msg which bypasses the problem but the output has non-ASCII characters in it ... perhaps you know a quick fix to it :)

BoPeng commented 5 months ago

ok, the patch is updated, it should work w/wo option stderr and w/wo task:. Please let me know if it works as expected.

gaow commented 5 months ago

Thanks @BoPeng the patch works well in a couple of tests I tried. But for the conda release -- I see that there are some failed tests on #1533 should they be ignored?

BoPeng commented 5 months ago

I will clean up the code (pylint) and make a release.

BoPeng commented 5 months ago

sos 0.24.5 is relased.

gaow commented 5 months ago

Thank you @BoPeng . It's not here yet: https://anaconda.org/conda-forge/sos but i guess it will show up soon?

BoPeng commented 5 months ago

Yes, it should be there in a few hours after the pypi release.

gaow commented 5 months ago

I am not sure if that will be the case ... according to the release history, conda should have version 0.24.4: https://pypi.org/project/sos/#history

However, it is still 0.24.3 https://anaconda.org/conda-forge/sos

Perhaps there are some check fails that prevents it from getting onto conda-forge?

gaow commented 5 months ago

It's posted after I merged PR.