Closed aryarm closed 2 years ago
@aryam7 I've reproduced your error! One thing we could do is check if the prefix is already applied in the workflow function, e.g., in self.workflow.apply_default_remote(value)
. But the worry would be if someone had a filepath that strangely matched the prefix name (unlikely, but could happen!) What did you have in mind for using incomplete? For example in this case, incomplete is True. I'm not familiar yet with how checkpoints work so I don't have insight as to whether this extends beyond the lifesciences executor.
Great! Yes, that would be a quick solution! I'm also worried about the case where the filepath matches the prefix name, so that's why I've been exploring other options
My idea for using incomplete
is that if rules.apply_default_remote()
is being called more than once for checkpoints, then incomplete
might be false the first time but true on the second.
I haven't tested this yet, but if it's true, then the problem maybe extends beyond the life sciences executor (to any situation in which the default_remote
is applied?)
Update: the exception occurs before incomplete is set to false on the second run, so that can't be it.
Any updates on this? I'm running into a similar issue in my workflow with checkpoints when I try and use the --kubernetes
executor. I may dig into this example tomorrow to try and understand what the underlying issue is...
@johanneskoester thoughts on this?
Ok, I may have found somewhat of a fix? I just deleted lines 722 and 723 out of rules.py
, and everything just started working again.
I don't really understand the underlying reason, but I'm guessing something about checkpoints causes apply_default_remote()
in rules.py
to be run more times than usual for a rule. So my approach was to try to figure out where that code is. And lines 722 and 723 seem like good candidates.
I'm planning on spending some more time on it tomorrow. Once I feel like I understand things better, I can try to explain here and maybe write up a PR!
@aryarm I tried applying that fix but am still getting an error when I try testing it on minikube.
The following runs the tutorial successfully:
start minikube
mkdir ~/snakemake-checkpoint-issue
cd ~/snakemake-checkpoint-issue
#
# Build latest snakemake
#
git clone https://github.com/snakemake/snakemake
cd ~/snakemake-checkpoint-issue/snakemake
# Checkout v5.25.0
DEV_COMMIT_SHA=a9d155ec
git checkout $DEV_COMMIT_SHA
DEV_IMAGE=snakemake:dev0_${DEV_COMMIT_SHA}
docker build . -t $DEV_IMAGE
#
# Create local runner
#
rm -f local.Dockerfile
cat <<EOF > local.Dockerfile
FROM $DEV_IMAGE
RUN /bin/bash -c "pip install kubernetes"
EOF
LOCAL_IMAGE=snakemake_local:dev0_${DEV_COMMIT_SHA}
docker build -f local.Dockerfile -t $LOCAL_IMAGE .
cd ~/snakemake-checkpoint-issue
#
# Setup tutorial data
#
git clone https://github.com/snakemake/snakemake-tutorial-data
cd ~/snakemake-checkpoint-issue/snakemake-tutorial-data
# Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for our s3 bucket
$(aws-okta env dev-admin-ml-okta)
export AWS_DEFAULT_PROFILE=$AWS_OKTA_PROFILE
# Set S3_REMOTE_PREFIX to our remote prefix
export S3_REMOTE_PREFIX=my.s3.bucket.boop/minikube_snakemake_artifacts/
aws s3 sync data/ s3://$S3_REMOTE_PREFIX
# Un-gitify the directory so that snakemake doesn't try and upload data/
mv .git .git.bak
# Add missing plot-quals.py
cat <<EOF > plot-quals.py
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from pysam import VariantFile
quals = [record.qual for record in VariantFile(snakemake.input[0])]
plt.hist(quals)
plt.savefig(snakemake.output[0])
EOF
# Run workflow
docker run -it --rm \
-v $HOME/.kube:/root/.kube \
-v $HOME/.minikube:$HOME/.minikube \
-v $PWD:/data \
--workdir /data \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-e AWS_SECURITY_TOKEN \
-e AWS_SESSION_TOKEN \
-e AWS_DEFAULT_REGION \
$LOCAL_IMAGE \
snakemake \
--verbose \
--debug \
--use-conda \
--kubernetes \
--default-remote-provider S3 \
--default-remote-prefix $S3_REMOTE_PREFIX \
--envvars \
AWS_ACCESS_KEY_ID \
AWS_SECRET_ACCESS_KEY \
AWS_SECURITY_TOKEN \
AWS_SESSION_TOKEN \
AWS_DEFAULT_REGION
This completes successfully.
Next, I try and reproduce the original issue:
#
# Convert tutorial to checkpoint
#
mv .git.bak .git
rm -f 1.patch
# Apply patch
cat <<EOF > 1.patch
index 68974d1..d519d4f 100644
--- Snakefile
+++ Snakefile
@@ -4,7 +4,7 @@ rule all:
input:
"plots/quals.svg"
-rule bwa_map:
+checkpoint bwa_map:
input:
fastq="samples/{sample}.fastq",
idx=multiext("genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
@@ -19,7 +19,7 @@ rule bwa_map:
rule samtools_sort:
input:
- "mapped_reads/{sample}.bam"
+ lambda wildcards: checkpoints.bwa_map.get(sample=wildcards.sample).output[0]
output:
"sorted_reads/{sample}.bam"
conda:
EOF
git apply 1.patch
mv .git .git.bak
#
# Run with tutorial as checkpoint
#
# Clear out previous results and resync data
aws s3 rm s3://$S3_REMOTE_PREFIX --recursive --include "*"
aws s3 sync data/ s3://$S3_REMOTE_PREFIX
# Run workflow
docker run -it --rm \
-v $HOME/.kube:/root/.kube \
-v $HOME/.minikube:$HOME/.minikube \
-v $PWD:/data \
--workdir /data \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-e AWS_SECURITY_TOKEN \
-e AWS_SESSION_TOKEN \
-e AWS_DEFAULT_REGION \
$LOCAL_IMAGE \
snakemake \
--verbose \
--debug \
--use-conda \
--kubernetes \
--default-remote-provider S3 \
--default-remote-prefix $S3_REMOTE_PREFIX \
--envvars \
AWS_ACCESS_KEY_ID \
AWS_SECRET_ACCESS_KEY \
AWS_SECURITY_TOKEN \
AWS_SESSION_TOKEN \
AWS_DEFAULT_REGION
This gives me the original error:
Building DAG of jobs...
Full Traceback (most recent call last):
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/__init__.py", line 658, in snakemake
success = workflow.execute(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/workflow.py", line 684, in execute
dag.init()
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 173, in init
job = self.update([job], progress=progress)
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 753, in update
raise exceptions[0]
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 711, in update
self.update_(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 819, in update_
raise ex
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 803, in update_
selected_job = self.update(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 753, in update
raise exceptions[0]
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 711, in update
self.update_(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 819, in update_
raise ex
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 803, in update_
selected_job = self.update(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 753, in update
raise exceptions[0]
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 711, in update
self.update_(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 819, in update_
raise ex
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 803, in update_
selected_job = self.update(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 753, in update
raise exceptions[0]
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 711, in update
self.update_(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 848, in update_
raise MissingInputException(job.rule, missing_input)
snakemake.exceptions.MissingInputException: Missing input files for rule samtools_sort:
my.s3.bucket.boop/minikube_snakemake_artifacts/my.s3.bucket.boop/minikube_snakemake_artifacts/mapped_reads/A.bam
MissingInputException in line 20 of /data/Snakefile:
Missing input files for rule samtools_sort:
my.s3.bucket.boop/minikube_snakemake_artifacts/my.s3.bucket.boop/minikube_snakemake_artifacts/mapped_reads/A.bam
unlocking
removed all locks
Next, I try to apply your recommended fix:
#
# Try fix removing snakemake/rules.py:L722-733
#
cd ~/snakemake-checkpoint-issue/snakemake
cat <<EOF > fix1.patch
index 913019e5..3a260386 100644
--- a/snakemake/rules.py
+++ b/snakemake/rules.py
@@ -719,8 +719,6 @@ class Rule:
is_unpack=is_unpack,
**aux_params
)
- if apply_default_remote:
- item = self.apply_default_remote(item)
if is_unpack and not incomplete:
if not allow_unpack:
EOF
git apply fix1.patch
DEV_IMAGE_FIX_1=snakemake:dev1_${DEV_COMMIT_SHA}
LOCAL_IMAGE_FIX_1=snakemake_local:dev1_${DEV_COMMIT_SHA}
# Build local runner
rm -f local.Dockerfile
cat <<EOF > local.Dockerfile
FROM $DEV_IMAGE_FIX_1
RUN /bin/bash -c "pip install kubernetes"
EOF
docker build . -t $DEV_IMAGE_FIX_1
docker build -f local.Dockerfile -t $LOCAL_IMAGE_FIX_1 .
#
# Run with "fix removing snakemake/rules.py:L722-733"
#
cd ~/snakemake-checkpoint-issue/snakemake-tutorial-data
# Run workflow
docker run -it --rm \
-v $HOME/.kube:/root/.kube \
-v $HOME/.minikube:$HOME/.minikube \
-v $PWD:/data \
--workdir /data \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-e AWS_SECURITY_TOKEN \
-e AWS_SESSION_TOKEN \
-e AWS_DEFAULT_REGION \
$LOCAL_IMAGE_FIX_1 \
snakemake \
--verbose \
--debug \
--use-conda \
--kubernetes \
--default-remote-provider S3 \
--default-remote-prefix $S3_REMOTE_PREFIX \
--envvars \
AWS_ACCESS_KEY_ID \
AWS_SECRET_ACCESS_KEY \
AWS_SECURITY_TOKEN \
AWS_SESSION_TOKEN \
AWS_DEFAULT_REGION
but I get this error :disappointed:
Building DAG of jobs...
Full Traceback (most recent call last):
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/__init__.py", line 658, in snakemake
success = workflow.execute(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/workflow.py", line 684, in execute
dag.init()
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 173, in init
job = self.update([job], progress=progress)
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 753, in update
raise exceptions[0]
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 711, in update
self.update_(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 819, in update_
raise ex
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 803, in update_
selected_job = self.update(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 753, in update
raise exceptions[0]
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 711, in update
self.update_(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 819, in update_
raise ex
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 803, in update_
selected_job = self.update(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 753, in update
raise exceptions[0]
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 711, in update
self.update_(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 819, in update_
raise ex
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 803, in update_
selected_job = self.update(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 753, in update
raise exceptions[0]
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 711, in update
self.update_(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 819, in update_
raise ex
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 803, in update_
selected_job = self.update(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 753, in update
raise exceptions[0]
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 711, in update
self.update_(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 848, in update_
raise MissingInputException(job.rule, missing_input)
snakemake.exceptions.MissingInputException: Missing input files for rule bwa_map:
my.s3.bucket.boop/minikube_snakemake_artifacts/genome.fa.ann
my.s3.bucket.boop/minikube_snakemake_artifacts/genome.fa.sa
my.s3.bucket.boop/minikube_snakemake_artifacts/genome.fa.amb
my.s3.bucket.boop/minikube_snakemake_artifacts/samples/A.fastq
my.s3.bucket.boop/minikube_snakemake_artifacts/genome.fa.pac
my.s3.bucket.boop/minikube_snakemake_artifacts/genome.fa.bwt
MissingInputException in line 7 of /data/Snakefile:
Missing input files for rule bwa_map:
my.s3.bucket.boop/minikube_snakemake_artifacts/genome.fa.ann
my.s3.bucket.boop/minikube_snakemake_artifacts/genome.fa.sa
my.s3.bucket.boop/minikube_snakemake_artifacts/genome.fa.amb
my.s3.bucket.boop/minikube_snakemake_artifacts/samples/A.fastq
my.s3.bucket.boop/minikube_snakemake_artifacts/genome.fa.pac
my.s3.bucket.boop/minikube_snakemake_artifacts/genome.fa.bwt
unlocking
removed all locks
Interestingly, if I revert the Snakefile from using a checkpoint back to using a rule (but keep the deletion of lines 722-723), the run succeeds:
mv .git.bak .git
git stash
mv .git .git.bak
docker run -it --rm \
-v $HOME/.kube:/root/.kube \
-v $HOME/.minikube:$HOME/.minikube \
-v $PWD:/data \
--workdir /data \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-e AWS_SECURITY_TOKEN \
-e AWS_SESSION_TOKEN \
-e AWS_DEFAULT_REGION \
$LOCAL_IMAGE_FIX_1 \
snakemake \
--verbose \
--debug \
--use-conda \
--kubernetes \
--default-remote-provider S3 \
--default-remote-prefix $S3_REMOTE_PREFIX \
--envvars \
AWS_ACCESS_KEY_ID \
AWS_SECRET_ACCESS_KEY \
AWS_SECURITY_TOKEN \
AWS_SESSION_TOKEN \
AWS_DEFAULT_REGION
One detail to keep in mind if you are testing with a container orchestrator (e.g. Kubernetes or GLS) is that the version of Snakemake you run from (In your host) is not necessarily == the version the remote builder uses (or the version in you cluster), you have to set the container image flag to determine this - https://github.com/snakemake/snakemake/blob/be2d72caac18cd025480521f16a3a7202b482e27/snakemake/executors/__init__.py#L1447. So what would typically happen with a release of Snakemake is the current release Docker image would be used, but in practice in development I’ve had to do a git fetch to retrieve all tags and derive an updated version. If you’ve built a custom local image then likely it’s using the tag from the last version release and not your changed code. To have the worker use your changes you would want to push the container with your changes somewhere and then specify with the container image flag.
That's a good point, @vsoch. Thank you! At the moment, I've just been trying to fix these exceptions for when the DAG gets evaluated locally (ie before any Snakemake Docker images are used on the cloud). Now I realize that this bug also causes exceptions to arise within jobs that get submitted to the cloud, too. I'll try creating a custom container image, like you suggested. But how would I get the CI test to work in the PR?
@austinkeller, it sounds like the fix isn't working for you because of the issue that @vsoch mentioned. We won't be able to properly test whether the fix works unless we use a custom container image.
Also, as promised, here's an explanation of what I think is happening.
When Snakemake parses a Snakefile, one of its very first steps is to set the input and output of every rule based on the user-provided input/output directives. While this is happening, it also calls rules.apply_default_remote()
on every input and output string in the Snakefile. So after it's done parsing the Snakefile, each rule
object will have prefixed input
and output
strings.
As we know, checkpoints trigger the DAG to get reevaluated, though. During that process, the wildcards for every input are injected into the input
and output
of each rule
via _apply_wildcards()
. But unfortunately, one of the steps in _apply_wildcards()
is to call rules.apply_default_remote()
again.
I'm not sure why rules.apply_default_remote()
is called from _apply_wildcards()
if every rule will already have the prefix from earlier. So instead of deleting the lines that call rules.apply_default_remote()
in _apply_wildcards()
, I think it might be best to just check if incomplete
is true before calling rules.apply_default_remote()
, since incomplete
will be true only if the input is the output of a checkpoint.
So here's my proposed solution:
- if apply_default_remote:
+ # incomplete items will already have the default remote prefix
+ if apply_default_remote and not incomplete:
I created this test to minimally reproduce the issue. The fix that I'm recommending seems to work for that test, at least until it tries to use the older code on the cloud because of the debugging problem @vsoch mentioned.
@aryarm I think we are working on this in #691, it's just been a bit slow because the GLS testing is currently down. But we did see an issue there that led to the double parsing of the bucket name.
Ah, ok! Thank you for pinging me here to let me know, @vsoch! If you find yourself in need of a test to reproduce the double parsing, here's a commit that might be helpful.
I haven't been able to work on this issue much recently because of other commitments, but I may have more time later if you need me.
Thanks @aryarm I really appreciate that! I'm hoping we will get our testing running again soon, and I'll take a look at the commit to see if it can help the current test.
hey @aryarm ! We just merged a PR that at least can handle the redundant name for standard runs, would you care to test, and also check for the checkpoint case? If you find the issue for checkpoint, please provide a simple example to reproduce with the current master.
yay! thank you so much, @vsoch! I'll check it out once a new release with the fix is published
Being able to test the GLS code via CI will also be really helpful for me! If this issue persists, I should be able to finally test that idea for a fix that I had before.
I agree! @johanneskoester was on fire this morning :) :fire:
Issue still exists using tibanna (AWS) with checkpoints. The default-remote-prefix is appended twice.
@klai5, so the fix that I tried in aryarm@7e6c2d5 didn't help on tibanna?
@aryarm, I tested 5.22.1 with your solution and also the 6.0.5 version (latest). I receive the following error with your fix:
snakemake/executors/init.py", line 1827, in handle_remote if isinstance(target, _IOFile) and target.remote_object.provider.is_default: AttributeError: 'NoneType' object has no attribute 'provider'
Hi folks, I'd like to work on a fix for this issue. @aryarm as requested here, could you provide first a PR with your testcase?
could you provide first a PR with your testcase?
done!
Evidently, the issue does still occur in the latest version of Snakemake (v6.9.0). It reared back up when I ran nosetests
locally.
Thanks in advance for resolving this, @johanneskoester 😄
Any update on this bug? It seems to still be present on 6.10.0.
@cademirch, we're working on reproducing it in #1199 It's been difficult to verify that the test case actually fails properly because the CI needs to be instructed to run the test case on GCP. At the moment, the challenge seems to be some sort of bug with passing secrets in the CI.
Once that's done, we can finally try the original solution I proposed or come up with another if that doesn't work. Unfortunately, we can't properly test any solution until we can reproduce the issue in the CI because of the way that Snakemake submits jobs to GCP through the GLS: even if you implement a solution on your local computer, Snakemake will use a different version of its own code within the GCP instance. See this comment for details. We would need to upload a custom container and instruct Snakemake to use that, instead.
@aryarm Okay, I see - thanks for the update. Unfortunately I'm not very familiar with the Snakemake code, but I've taken a quick look and it seems the logic for applying the default remote prefix has moved to here https://github.com/snakemake/snakemake/blob/01d6102795c96ce695d6d7201f7e4655a1d5cac8/snakemake/path_modifier.py#L14 I'm not sure how much I can help with this, but I'll take a deeper look sometime this week as I'm also trying to figure out what is causing #1260.
Snakemake version 5.22.1
Describe the bug Using the
--default-remote-prefix
parameter will causeMissingInputException
errors to arise for the outputs of checkpoint rules.Steps to Reproduce Follow the Google Life Sciences Executor Tutorial but convert the
bwa_map
rule into a checkpoint like thisand then you should see the following when running the pipeline:
Notice how
snakemake-testing-data
appears prepended twice?Bug Hypothesis I think
rules.apply_default_remote()
is being applied to the output more than once. It might help to check whetherincomplete
is true on this line before executingrules.apply_default_remote()
?It might be possible that this is a bug that extends beyond the life sciences executor (to other cloud environments), but I haven't tested that yet.