langmead-lab / monorail-external

examples to run monorail externally
MIT License
13 stars 5 forks source link

Error in Unifier rule "split_final_rejoined_exons" #4

Closed RamseyKamar closed 3 years ago

RamseyKamar commented 3 years ago

Dear @ChristopherWilks

I am running the Unifier using the recount-unify_1.0.4.sif image and I'm getting either of two errors in the split_final_rejoined_exons rule depending on how I set up sample_metadata.tsv. I think I'm setting up the outputs from the Pump correctly according to the instructions in the README, so I believe there is a bug in the snakemake pipeline (unless I'm missing something!).

I have created a test set of samples consisting of four samples from two separate studies which I analyzed with the Pump. I moved the Pump outputs to <Monorail repo root>/to_be_unified/debug_test_2/ as shown below.

My directory structure setup for running the Unifier is as follows:

|-- <Monorail repo root>/ <-- assigned to REF_DIR_HOST in run_recount_unify.sh |------- hg38_unify/ |------- to_be_unified/ |------------- debug_test_2/ <-- assigned to INPUT_DIR_HOST |------------------- AAC865-201001_A00723-GX_IL_LIB_20_I340_AAC865_VR_001_att0/ |------------------- AAC865-201001_A00723-GX_IL_LIB_20_I341_AAC865_VR_002_att0/ |------------------- DA074-141103_SN7001396_0134_BC52GHACXX-s_1-BC11_GGCTAC_att0/ |------------------- DA074-141103_SN7001396_0134_BC52GHACXX-s_1-BC1_ATCACG_att0/ |------- manifests/ |------------- debug_test_2/ |------------------- sample_metadata.tsv <-- assigned to SAMPLE_ID_MANIFEST_HOST |------- unifier_output/ |------------- debug_test_2/ <-- assigned to WORKING_DIR_HOST

As an example, DA074-141103_SN7001396_0134_BC52GHACXX-s_1-BC1_ATCACG_att0/ contains the following: image

This is the command line call using the above setup:

./singularity/run_recount_unify.sh <Monorail repo root>/recount-unify_1.0.4.sif hg38 <Monorail repo root> <Monorail repo root>/unifier_output/debug_test_2 <Monorail repo root>/to_be_unified/debug_test_2 <Monorail repo root>/manifests/debug_test_2/sample_metadata.tsv 20 nvstest:102

It was unclear whether I was supposed to include the "_att0" in the sample names within the manifest, so I tried both.

In the first case (removing the "_att0"), sample_metadata.tsv contains:

study_id        sample_id       fake_metadata
TRT_study       DA074-141103_SN7001396_0134_BC52GHACXX-s_1-BC11_GGCTAC  brain
TRT_study       DA074-141103_SN7001396_0134_BC52GHACXX-s_1-BC1_ATCACG   skin
DLBCL_relapse   AAC865-201001_A00723-GX_IL_LIB_20_I340_AAC865_VR_001    brain
DLBCL_relapse   AAC865-201001_A00723-GX_IL_LIB_20_I341_AAC865_VR_002    skin

(obviously I'm naming the two studies "TRT_study" and "DLBCL_relapse")

and I get the following error:

[Fri Apr 30 17:31:01 2021]
rule compress_final_rejoined_genes:
    input: SIRV.gene_sums.tsv
    output: SIRV.gene_sums.tsv.gz
    jobid: 12
    wildcards: annotation=SIRV
    threads: 8

                cat SIRV.gene_sums.tsv | pigz --fast -p 8 > SIRV.gene_sums.tsv.gz

[Fri Apr 30 17:31:01 2021]
Finished job 12.
36 of 45 steps (80%) done
building annotation set done, disjoint2annot map size: 1860460, original annotation map size: 1709834
[Fri Apr 30 17:31:19 2021]
Finished job 5.
37 of 45 steps (82%) done

[Fri Apr 30 17:31:19 2021]
rule split_final_rejoined_exons:
    input: all.exon_counts.rejoined.tsv.gz, all.exon_counts.rejoined.tsv.gz.accession_header
    output: exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.G026.gz, exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.G029.gz, exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.R109.gz, ex
on_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.F006.gz, exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.ERCC.gz, exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.SIRV.gz
    jobid: 19
    threads: 20

                /bin/bash /recount-unify/rejoin/split_out_exon_sums_by_study.sh nvstest G026,G029,R109,F006,ERCC,SIRV 1709834 /container-mounts/ref/exon_bitmasks.tsv /container-mounts/ref/exon_bitmask_coords.tsv all.exon_counts.rejoined
.tsv.gz 20
        rm -rf exons_split_by_study_temp exon_annotation_split_runs

Waiting at most 5 seconds for missing files.
MissingOutputException in line 318 of /recount-unify/Snakefile:
Missing files after 5 seconds:
exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.G026.gz
exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.G029.gz
exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.R109.gz
exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.F006.gz
exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.ERCC.gz
exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.SIRV.gz
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /container-mounts/working/.snakemake/log/2021-04-30T173032.405840.snakemake.log

... and this is what <Monorail repo root>/unifier_output/debug_test_2/ ends up with: image

It's clear that the pipeline makes partial progress since, for example: zcat all.exon_bw_count.pasted.gz | head -n 4

gene    start   end     name    score   strand  837666  837668  835718  835982
ERCC-00002      0       1061    .       0       +       838614  623434  0       0
ERCC-00003      0       1023    .       0       +       32575   31209   0       0
ERCC-00004      0       523     .       0       +       111972  81209   0       0

In the second case (keeping the "_att0"), sample_metadata.tsv contains:

study_id        sample_id       fake_metadata
TRT_study       DA074-141103_SN7001396_0134_BC52GHACXX-s_1-BC11_GGCTAC_att0     brain
TRT_study       DA074-141103_SN7001396_0134_BC52GHACXX-s_1-BC1_ATCACG_att0      skin
DLBCL_relapse   AAC865-201001_A00723-GX_IL_LIB_20_I340_AAC865_VR_001_att0       brain
DLBCL_relapse   AAC865-201001_A00723-GX_IL_LIB_20_I341_AAC865_VR_002_att0       skin

This time I get this error:

[Fri Apr 30 21:08:56 2021]
rule compress_final_rejoined_genes:
    input: ERCC.gene_sums.tsv
    output: ERCC.gene_sums.tsv.gz
    jobid: 11
    wildcards: annotation=ERCC
    threads: 8

                cat ERCC.gene_sums.tsv | pigz --fast -p 8 > ERCC.gene_sums.tsv.gz

[Fri Apr 30 21:08:56 2021]
Finished job 11.
35 of 45 steps (78%) done

[Fri Apr 30 21:08:56 2021]
rule compress_final_rejoined_genes:
    input: SIRV.gene_sums.tsv
    output: SIRV.gene_sums.tsv.gz
    jobid: 12
    wildcards: annotation=SIRV
    threads: 8

                cat SIRV.gene_sums.tsv | pigz --fast -p 8 > SIRV.gene_sums.tsv.gz

[Fri Apr 30 21:08:56 2021]
Finished job 12.
36 of 45 steps (80%) done
[Fri Apr 30 21:09:12 2021]
Finished job 5.
37 of 45 steps (82%) done

[Fri Apr 30 21:09:12 2021]
rule split_final_rejoined_exons:
    input: all.exon_counts.rejoined.tsv.gz, all.exon_counts.rejoined.tsv.gz.accession_header
    output: exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.G026.gz, exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.G029.gz, exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.R109.gz, ex
on_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.F006.gz, exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.ERCC.gz, exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.SIRV.gz
    jobid: 19
    threads: 20

                /bin/bash /recount-unify/rejoin/split_out_exon_sums_by_study.sh nvstest G026,G029,R109,F006,ERCC,SIRV 1709834 /container-mounts/ref/exon_bitmasks.tsv /container-mounts/ref/exon_bitmask_coords.tsv all.exon_counts.rejoined
.tsv.gz 20
        rm -rf exons_split_by_study_temp exon_annotation_split_runs

TRT_study       no_runs
DLBCL_relapse   no_runs
ls: cannot access 'exons_split_by_study_temp/*.gz': No such file or directory
[Fri Apr 30 21:09:13 2021]
Error in rule split_final_rejoined_exons:
    jobid: 19
    output: exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.G026.gz, exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.G029.gz, exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.R109.gz, ex
on_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.F006.gz, exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.ERCC.gz, exon_sums_per_study/DY/LOCAL_STUDY/nvstest.exon_sums.LOCAL_STUDY.SIRV.gz

RuleException:
CalledProcessError in line 333 of /recount-unify/Snakefile:
Command ' set -euo pipefail;  
                /bin/bash /recount-unify/rejoin/split_out_exon_sums_by_study.sh nvstest G026,G029,R109,F006,ERCC,SIRV 1709834 /container-mounts/ref/exon_bitmasks.tsv /container-mounts/ref/exon_bitmask_coords.tsv all.exon_counts.rejoined
.tsv.gz 20
        rm -rf exons_split_by_study_temp exon_annotation_split_runs ' returned non-zero exit status 2.
  File "/recount-unify/Snakefile", line 333, in __rule_split_final_rejoined_exons
  File "/opt/conda/envs/recount-unify/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /container-mounts/working/.snakemake/log/2021-04-30T210827.769649.snakemake.log

and the contents of the working directory is: image

I suspect that the first way (removing the "_att0" in the sample ids in the manifest) is the right way to do it since it seems to get further into the snakemake rule.

I started from scratch for both versions of the manifest and I can't see what I'm doing wrong in the setup. Would it be possible for you to reproduce the error with some test pump outputs on your end (like maybe just use some dummy SRA pump outputs as Unifier inputs)?

Thank you for your assistance!

Kind regards,

Ramsey

daria-dc commented 3 years ago

Just wanted to mention that I am getting the same error:

rule split_final_rejoined_exons:
    input: all.exon_counts.rejoined.tsv.gz, all.exon_counts.rejoined.tsv.gz.accession_header
    output: exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.G026.gz, exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.G029.gz, exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.R109.gz, exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.F006.gz, exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.ERCC.gz, exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.SIRV.gz
    jobid: 19
    threads: 40

                /bin/bash /recount-unify/rejoin/split_out_exon_sums_by_study.sh bs G026,G029,R109,F006,ERCC,SIRV 1709834 /container-mounts/ref/exon_bitmasks.tsv /container-mounts/ref/exon_bitmask_coords.tsv all.exon_counts.rejoined.tsv.gz 40
        rm -rf exons_split_by_study_temp exon_annotation_split_runs

Waiting at most 5 seconds for missing files.
MissingOutputException in line 318 of /recount-unify/Snakefile:
Missing files after 5 seconds:
exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.G026.gz
exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.G029.gz
exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.R109.gz
exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.F006.gz
exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.ERCC.gz
exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.SIRV.gz
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
ChristopherWilks commented 3 years ago

@RamseyKamar yes, the first way (removing the "_att0" in the sample ids in the manifest) is the right way.

ok, I think this issue is due to the pump having "LOCAL_STUDY" as the study name, while the sample_metadata.tsv passed to the Unifier has the actual study name. I added actual_study as the final (optional) parameter in the Unifier to fix this a while back but didn't make it into the documentation (until now):

https://github.com/langmead-lab/monorail-external/blob/194b898e21dfef11f382405983aa40ba61921a43/singularity/run_recount_pump.sh#L32

So I suggest, for @RamseyKamar and @daria-dc to re-run pump first with that extra param set to what's going to be the study name exactly as it appears in sample_metadata.tsv for the appropriate samples (hopefully you're both testing on a small set of samples).

daria-dc commented 3 years ago

Thanks for your help @ChristopherWilks,

for me your suggestion worked by just renaming the files!

ChristopherWilks commented 3 years ago

for the record, I agree with @daria-dc, file renaming is another approach to solving this which should work.

RamseyKamar commented 3 years ago

Thank you @ChristopherWilks and @daria-dc I will try this!

ChristopherWilks commented 3 years ago

@RamseyKamar and @daria-dc you should know that I just updated the recount-unify docker image today to 1.0.8.

It fixes a couple of fairly major issues with getting the outputs of the Unifier to work in recount3. Specifically it fixes the incorrect casing of .all. and .unique. in the junction file names and the proper handling of custom metadata if SHORT_PROJECT_NAME is set to something other than sra (though this might have been fixed before, I don't recall).

RamseyKamar commented 3 years ago

@ChristopherWilks Thanks for the heads up! I'll pull the newest image and use that.

RamseyKamar commented 3 years ago

Dear @ChristopherWilks, Thanks again for the update to the unifier image and the previous suggestions on study names. I tried running the Unifier again, this time renaming the input files from the pump with LOCAL_STUDY replaced with the actual study_id and it works.

However, I have a question about a scenario that could happen when trying to unify several studies together, despite having unique study names. I noticed that the final (and some intermediate) results are stored in a directory structure where the last two characters of the study name are used for the name of the directory, e.g. directories dy and se get created for study name TRT_study and DLBCL_relapse respectively. Although not true in my particular case because the final two characters of my study names are distinct from each other, let's say I have two studies named first_study and second_study. I haven't checked this myself, but will the fact that they have the same last two characters dy cause problems or is the Unifier robust to that as long as the overall study names are unique? I suppose this won't be a problem at all if you just run the pipeline for getting Snaptron-ready output with the SKIP_SUMS=1 flag.

Thanks, Ramsey

ChristopherWilks commented 3 years ago

Hi @RamseyKamar as you probably already determined this case should be fine---the unifier is built to handle multiple studies this way (i.e. using the last two characters of the study as a way to split the group of total studies into more manageable bins for really large groups of studies).

RamseyKamar commented 3 years ago

@ChristopherWilks Thanks for this! Closing this issue.