Closed RamseyKamar closed 3 years ago
Just wanted to mention that I am getting the same error:
rule split_final_rejoined_exons:
input: all.exon_counts.rejoined.tsv.gz, all.exon_counts.rejoined.tsv.gz.accession_header
output: exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.G026.gz, exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.G029.gz, exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.R109.gz, exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.F006.gz, exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.ERCC.gz, exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.SIRV.gz
jobid: 19
threads: 40
/bin/bash /recount-unify/rejoin/split_out_exon_sums_by_study.sh bs G026,G029,R109,F006,ERCC,SIRV 1709834 /container-mounts/ref/exon_bitmasks.tsv /container-mounts/ref/exon_bitmask_coords.tsv all.exon_counts.rejoined.tsv.gz 40
rm -rf exons_split_by_study_temp exon_annotation_split_runs
Waiting at most 5 seconds for missing files.
MissingOutputException in line 318 of /recount-unify/Snakefile:
Missing files after 5 seconds:
exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.G026.gz
exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.G029.gz
exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.R109.gz
exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.F006.gz
exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.ERCC.gz
exon_sums_per_study/DY/LOCAL_STUDY/bs.exon_sums.LOCAL_STUDY.SIRV.gz
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
@RamseyKamar yes, the first way (removing the "_att0" in the sample ids in the manifest) is the right way.
ok, I think this issue is due to the pump having "LOCAL_STUDY" as the study name, while the sample_metadata.tsv
passed to the Unifier has the actual study name. I added actual_study
as the final (optional) parameter in the Unifier to fix this a while back but didn't make it into the documentation (until now):
So I suggest, for @RamseyKamar and @daria-dc to re-run pump first with that extra param set to what's going to be the study name exactly as it appears in sample_metadata.tsv
for the appropriate samples (hopefully you're both testing on a small set of samples).
Thanks for your help @ChristopherWilks,
for me your suggestion worked by just renaming the files!
for the record, I agree with @daria-dc, file renaming is another approach to solving this which should work.
Thank you @ChristopherWilks and @daria-dc I will try this!
@RamseyKamar and @daria-dc you should know that I just updated the recount-unify docker image today to 1.0.8.
It fixes a couple of fairly major issues with getting the outputs of the Unifier to work in recount3.
Specifically it fixes the incorrect casing of .all.
and .unique.
in the junction file names and the proper handling of custom metadata if SHORT_PROJECT_NAME
is set to something other than sra
(though this might have been fixed before, I don't recall).
@ChristopherWilks Thanks for the heads up! I'll pull the newest image and use that.
Dear @ChristopherWilks, Thanks again for the update to the unifier image and the previous suggestions on study names. I tried running the Unifier again, this time renaming the input files from the pump with LOCAL_STUDY
replaced with the actual study_id
and it works.
However, I have a question about a scenario that could happen when trying to unify several studies together, despite having unique study names. I noticed that the final (and some intermediate) results are stored in a directory structure where the last two characters of the study name are used for the name of the directory, e.g. directories dy
and se
get created for study name TRT_study
and DLBCL_relapse
respectively. Although not true in my particular case because the final two characters of my study names are distinct from each other, let's say I have two studies named first_study
and second_study
. I haven't checked this myself, but will the fact that they have the same last two characters dy
cause problems or is the Unifier robust to that as long as the overall study names are unique? I suppose this won't be a problem at all if you just run the pipeline for getting Snaptron-ready output with the SKIP_SUMS=1
flag.
Thanks, Ramsey
Hi @RamseyKamar as you probably already determined this case should be fine---the unifier is built to handle multiple studies this way (i.e. using the last two characters of the study as a way to split the group of total studies into more manageable bins for really large groups of studies).
@ChristopherWilks Thanks for this! Closing this issue.
Dear @ChristopherWilks
I am running the Unifier using the
recount-unify_1.0.4.sif
image and I'm getting either of two errors in thesplit_final_rejoined_exons
rule depending on how I set upsample_metadata.tsv
. I think I'm setting up the outputs from the Pump correctly according to the instructions in the README, so I believe there is a bug in the snakemake pipeline (unless I'm missing something!).I have created a test set of samples consisting of four samples from two separate studies which I analyzed with the Pump. I moved the Pump outputs to
<Monorail repo root>/to_be_unified/debug_test_2/
as shown below.My directory structure setup for running the Unifier is as follows:
|--
<Monorail repo root>/
<-- assigned toREF_DIR_HOST
inrun_recount_unify.sh
|-------hg38_unify/
|-------to_be_unified/
|-------------debug_test_2/
<-- assigned toINPUT_DIR_HOST
|-------------------AAC865-201001_A00723-GX_IL_LIB_20_I340_AAC865_VR_001_att0/
|-------------------AAC865-201001_A00723-GX_IL_LIB_20_I341_AAC865_VR_002_att0/
|-------------------DA074-141103_SN7001396_0134_BC52GHACXX-s_1-BC11_GGCTAC_att0/
|-------------------DA074-141103_SN7001396_0134_BC52GHACXX-s_1-BC1_ATCACG_att0/
|-------manifests/
|-------------debug_test_2/
|-------------------sample_metadata.tsv
<-- assigned toSAMPLE_ID_MANIFEST_HOST
|-------unifier_output/
|-------------debug_test_2/
<-- assigned toWORKING_DIR_HOST
As an example,
DA074-141103_SN7001396_0134_BC52GHACXX-s_1-BC1_ATCACG_att0/
contains the following:This is the command line call using the above setup:
It was unclear whether I was supposed to include the "
_att0
" in the sample names within the manifest, so I tried both.In the first case (removing the "
_att0
"),sample_metadata.tsv
contains:(obviously I'm naming the two studies "TRT_study" and "DLBCL_relapse")
and I get the following error:
... and this is what
<Monorail repo root>/unifier_output/debug_test_2/
ends up with:It's clear that the pipeline makes partial progress since, for example:
zcat all.exon_bw_count.pasted.gz | head -n 4
In the second case (keeping the "
_att0
"),sample_metadata.tsv
contains:This time I get this error:
and the contents of the working directory is:
I suspect that the first way (removing the "
_att0
" in the sample ids in the manifest) is the right way to do it since it seems to get further into the snakemake rule.I started from scratch for both versions of the manifest and I can't see what I'm doing wrong in the setup. Would it be possible for you to reproduce the error with some test pump outputs on your end (like maybe just use some dummy SRA pump outputs as Unifier inputs)?
Thank you for your assistance!
Kind regards,
Ramsey