Unify script failing at annotate_all_sjs

mdrnao commented 2 years ago

Hello,

I'm trying a test set of two local samples and failing at the final steps of the Unify pipeline!

The image is pulled by singularity pull docker://quay.io/broadsword/recount-unify:latest on singularity singularity-3.8.5

Last few lines of running:


> Fri Feb  4 16:23:44 2022]
> Finished job 3.
> 22 of 24 steps (92%) done
> + sample_id_file=/container-mounts/working/ids.tsv
> + sample_original_metadata_file=/container-mounts/working/sample_metadata.tsv
> + cat /container-mounts/working/sample_metadata.tsv
> + perl -ne 'BEGIN { open(HDR,">/container-mounts/working/ids.tsv.new_header"); open(IN,"</container-mounts/working/ids.tsv"); @ids=<IN>; close(IN); chomp(@ids); %idmap=(); map { ($study,$sample,$rid)=split(/\t/,$_); $idmap{$study."|".$sample}=$rid; } @ids; } chomp; $i++; $f=$_; @f=split(/\t/,$f); $study=shift(@f); $sample=shift(@f); $f=join("\t",@f); $f="\t$f" if(scalar @f > 0); $rail_id=$idmap{"$study|$sample"}; if(!$rail_id) { if($i==1) { print HDR "rail_id\t$sample\t$study"."$f\n"; next; } print STDERR "failed to map $study\t$sample\t$f to a rail_id, terminating\n"; close(HDR); exit(-1); } print "$rail_id\t$sample\t$study"."$f\n"; END { close(HDR); }'
> failed t to a rail_id, terminating
> [Fri Feb  4 16:23:44 2022]
> Error in rule annotate_all_sjs:
>     jobid: 2
>     output: junctions.bgz, junctions.bgz.tbi, junctions.sqlite, samples.tsv, lucene_indexed_numeric_types.tsv, lucene_full_standard, lucene_full_ws, samples.fields.tsv
> 
> RuleException:
> CalledProcessError in line 299 of /recount-unify/Snakefile.study_jxs:
> Command ' set -euo pipefail;
>       if [[ "True" == "True" ]]; then
>             rm -f jx_sqlite_import
>             mkfifo jx_sqlite_import
>             sqlite3 junctions.sqlite < /recount-unify/snaptron/deploy/snaptron_schema.sql
>             cat all.sjs.motifs.merged.tsv | pypy /recount-unify/annotate/annotate_sjs.py --compiled-annotations /container-mounts/ref/annotated_junctions.tsv.gz --motif-correct --compilation-id 101 | tee jx_sqlite_import | bgzip -@ 7 > junctions.bgz &
>             sqlite3 junctions.sqlite -cmd '.separator "   "' ".import ./jx_sqlite_import intron"
>             sqlite3 junctions.sqlite < /recount-unify/snaptron/deploy/snaptron_schema_index.sql
>         else
>             cat all.sjs.motifs.merged.tsv | pypy /recount-unify/annotate/annotate_sjs.py --compiled-annotations /container-mounts/ref/annotated_junctions.tsv.gz --motif-correct --compilation-id 101 | bgzip -@ 7 > junctions.bgz
>             touch junctions.sqlite
>         fi
>         tabix -s2 -b3 -e4 junctions.bgz
>         if [[ "True" == "True" ]]; then
>             /recount-unify/scripts/join_railID_to_sample_metadata.sh /container-mounts/working/ids.tsv /container-mounts/working/sample_metadata.tsv | sort -t'   ' -k1,1n > sorted_samples.tsv
>             cat /container-mounts/working/ids.tsv.new_header sorted_samples.tsv > samples.tsv
>             /recount-unify/snaptron/deploy/build_lucene_indexes.sh samples.tsv all
>         else
>             touch samples.tsv lucene_indexed_numeric_types.tsv samples.fields.tsv
>             mkdir -p lucene_full_standard lucene_full_ws
>         fi ' returned non-zero exit status 255.
>   File "/recount-unify/Snakefile.study_jxs", line 299, in __rule_annotate_all_sjs
>   File "/opt/conda/envs/recount-unify/lib/python3.9/concurrent/futures/thread.py", line 52, in run
> Removing output files of failed job annotate_all_sjs since they might be corrupted:
> junctions.bgz, junctions.bgz.tbi, junctions.sqlite
> Shutting down, this might take some time.
> Exiting because a job execution failed. Look above for error message

Sample metadata file as:

study_id    sample_id
mirytest    DP9
mirytest    DP8

What I did notice is that ids.tsv.new_header file looks like (below), with a return between sample_id and study_id.

rail_id sample_id
    study_id

Run folder looks like this at the end with the sums etc populated:

all.exon_bw_count.pasted.gz                       blank_exon_sums                   ids.input                          M023.gene_sums.snaptron_header.tsv  setup_links.run
all.exon_bw_count.pasted.gz.pre_existing          ERCC.gene_sums.tsv                ids.input.group_counters           M023.gene_sums.tsv                  SIRV.gene_sums.tsv
all.exon_bw_count.pasted.gz.samples_header        ERCC.gene_sums.tsv.gz             ids.tsv                            M023.gene_sums.tsv.gz               SIRV.gene_sums.tsv.gz
all.exon_counts.rejoined.tsv.gz                   exon_checks.tsv                   ids.tsv.new_header                 patroller_find_done.tsv             sorted_samples.tsv
all.exon_counts.rejoined.tsv.gz.accession_header  exon_sums.annotation_splits.jobs  ids.tsv.num_samples_per_study.tsv  qc_1.tsv                            staging
all.exon_counts.rejoined.tsv.gz.coords            exon_sums_per_study               ids.tsv.studies                    qc.err                              staging_jxs
all.gene_counts.rejoined.tsv.gz                   exon_sums.study_splits.jobs       input_from_pump                    recount-unify_latest.sif            unique.exon_bw_count.pasted.gz
all.intron_counts.rejoined.tsv.gz                 gene_checks.tsv                   intron_counts_summed.tsv           recount-unify.output.jxs.txt        unique.exon_bw_count.pasted.gz.pre_existing
all_logs_and_tsvs.pkl                             gene_sums_per_study               intron_counts_summed.tsv.err       recount-unify.output.sums.txt
all.logs.tar.gz                                   gene_sums.splits.ERCC.jobs        junction_counts_per_study          recount-unify.sums.stats.json
all.sjs.motifs.merged.tsv                         gene_sums.splits.M023.jobs        jx_sqlite_import                   run_unify.sh
assign_compilation_ids.py.errs                    gene_sums.splits.SIRV.jobs        links                              sample_metadata.tsv

Other than the above observation - a little bit lost as to how to fix this. Thank you!

mdrnao commented 2 years ago

Further tests today!

Trying with your test fastq files and elongating my own file names, today I get stuck at a different error message consistently:

Running QC stats collection

python3 /recount-unify/log_qc/parse_logs_for_qc.py --incoming-dir links --sample-mapping /container-mounts/working/ids.tsv --intron-sums intron_counts_summed.tsv ++ cat qc_1.tsv ++ wc -l

num_samples_qc=1

num_samples_qc=0

[[ 0 -ne 1 ]]

echo 'FAILURE in pump output QC stats collection (post gene/exon unify), # QC rows (0) != # samples (1)' FAILURE in pump output QC stats collection (post gene/exon unify), # QC rows (0) != # samples (1)

exit -1

ChristopherWilks commented 2 years ago

hi @mdrnao

In your first post, it'd be good to see the contents of the following files: /container-mounts/working/ids.tsv /container-mounts/working/sample_metadata.tsv (you may have posted this one already, but I need to make sure)

but yes, I suggest elongating your sample names to at least 6 characters and re-running pump with the updated names and then unifier.

In your last post, to clarify, you are using these FASTQs(?)

http://snaptron.cs.jhu.edu/data/temp/SRR390728_1.fastq.gz

http://snaptron.cs.jhu.edu/data/temp/SRR390728_2.fastq.gz

also, what is the exact command you are running in both posts?

thanks, Chris

mdrnao commented 2 years ago

Hi Chris,

Thanks so much for getting back to me, I have a feeling this might be user error!

For simplicity, I'll display everything using the fastq files you provided.

I call pump like so, which ends with success:

/bin/bash ~/<full path>/recount/monorail-external/singularity/run_recount_pump.sh \
    ~/<full path>/recount/recount-rs5_latest.sif \
    SRR390728 \
    local \
    grcm38 \
    20 \
    ~/<full path>/recount/references \
    ~/<full path>/recount/test_set/SRR390728_1.fastq.gz \
    ~/<full path>/recount/test_set/SRR390728_2.fastq.gz \
    SRR390728

I then call unify with:

/bin/bash ~/<full path>/recount/monorail-external/singularity/run_recount_unify.sh \
    recount-unify_1.0.9.sif \
    grcm38 \
    ~/<full path>/recount/references \
    ~/<full path>/recount/test_set/unify/ \
    ~/<full path>/recount/test_set/pump/pump_output\
    ~/<full path>/recount/test_set/unify/sample_metadata.tsv \
    20 \
    examplefiles:110

in which the end of the nohup.out file displays:

Running QC stats collection
+ python3 /recount-unify/log_qc/parse_logs_for_qc.py --incoming-dir links --sample-mapping /container-mounts/working/ids.tsv --intron-sums intron_counts_summed.tsv
++ cat qc_1.tsv
++ wc -l
+ num_samples_qc=1
+ num_samples_qc=0
+ [[ 0 -ne 1 ]]
+ echo 'FAILURE in pump output QC stats collection (post gene/exon unify), # QC rows (0) != # samples (1)'
FAILURE in pump output QC stats collection (post gene/exon unify), # QC rows (0) != # samples (1)
+ exit -1
INFO:    Cleaning up image...

my sample_metadata.tsv file looks like so:

study_id    sample_id
testset SRR390728

and the ids.tsv file:

testset SRR390728 904784 I also have a patroller_find_done.tsv file:

testset links/et/testset/28/SRR390728/SRR390728_in2 0 links/et/testset/28/SRR390728/SRR390728_in2_att0 SRR390728!testset!grcm38!local.manifest SRR390728_in2_att0

and for completeness, qc_1.tsv looks like so:

study sample bc_auc.all_% bc_auc.unique_% exon_fc.all_% exon_fc.unique_% gene_fc.all_% gene_fc.unique_% intron_sum intron_sum_% star.all_mapped_reads star.all_mapped_reads2

I think this should satisfy the 6-character suggestion.

Thank you for the help, Holly

ChristopherWilks commented 2 years ago

Hi @mdrnao,

Thanks for the additional information and your patience.

So in the pump command, the last argument SRR390728 should be testset to match what you're running in the unifier. That is, when running samples, Monorail needs to know at least 3 things: 1) sample ID/name 2) study ID/name and 3) project name:project_id (aka datasource). And these need to be consistent throughout the process.

While you could try to switch the unifier to use SRR390728 as the study to match what you already did in pump, I'm not sure what will happen with both the sample and the study having the same ID so it's probably best to re-run pump with the proper study name and not mix sample and study names.

The second thing is, when running/debugging the test dataset, it's probably best to stick with the suggested default project_name:project_id of sra:101 instead of examplefiles:110 in the unifier command.

mdrnao commented 2 years ago

Hi @ChristopherWilks ,

Thanks for getting back to me! All re-run with the above and I'm afraid we have the same error.

+ echo 'Running QC stats collection'
Running QC stats collection
+ python3 /recount-unify/log_qc/parse_logs_for_qc.py --incoming-dir links --sample-mapping /container-mounts/working/ids.tsv --intron-sums intron_counts_summed.tsv
++ cat qc_1.tsv
++ wc -l
+ num_samples_qc=1
+ num_samples_qc=0
+ [[ 0 -ne 1 ]]
+ echo 'FAILURE in pump output QC stats collection (post gene/exon unify), # QC rows (0) != # samples (1)'
FAILURE in pump output QC stats collection (post gene/exon unify), # QC rows (0) != # samples (1)
+ exit -1
INFO:    Cleaning up image...

CallMeMisterOwl commented 2 years ago

Hello,

I'm trying a test set of two local samples and failing at the final steps of the Unify pipeline!

The image is pulled by singularity pull docker://quay.io/broadsword/recount-unify:latest on singularity singularity-3.8.5

Last few lines of running:


> Fri Feb  4 16:23:44 2022]
> Finished job 3.
> 22 of 24 steps (92%) done
> + sample_id_file=/container-mounts/working/ids.tsv
> + sample_original_metadata_file=/container-mounts/working/sample_metadata.tsv
> + cat /container-mounts/working/sample_metadata.tsv
> + perl -ne 'BEGIN { open(HDR,">/container-mounts/working/ids.tsv.new_header"); open(IN,"</container-mounts/working/ids.tsv"); @ids=<IN>; close(IN); chomp(@ids); %idmap=(); map { ($study,$sample,$rid)=split(/\t/,$_); $idmap{$study."|".$sample}=$rid; } @ids; } chomp; $i++; $f=$_; @f=split(/\t/,$f); $study=shift(@f); $sample=shift(@f); $f=join("\t",@f); $f="\t$f" if(scalar @f > 0); $rail_id=$idmap{"$study|$sample"}; if(!$rail_id) { if($i==1) { print HDR "rail_id\t$sample\t$study"."$f\n"; next; } print STDERR "failed to map $study\t$sample\t$f to a rail_id, terminating\n"; close(HDR); exit(-1); } print "$rail_id\t$sample\t$study"."$f\n"; END { close(HDR); }'
> failed t to a rail_id, terminating
> [Fri Feb  4 16:23:44 2022]
> Error in rule annotate_all_sjs:
>     jobid: 2
>     output: junctions.bgz, junctions.bgz.tbi, junctions.sqlite, samples.tsv, lucene_indexed_numeric_types.tsv, lucene_full_standard, lucene_full_ws, samples.fields.tsv
> 
> RuleException:
> CalledProcessError in line 299 of /recount-unify/Snakefile.study_jxs:
> Command ' set -euo pipefail;
>         if [[ "True" == "True" ]]; then
>             rm -f jx_sqlite_import
>             mkfifo jx_sqlite_import
>             sqlite3 junctions.sqlite < /recount-unify/snaptron/deploy/snaptron_schema.sql
>             cat all.sjs.motifs.merged.tsv | pypy /recount-unify/annotate/annotate_sjs.py --compiled-annotations /container-mounts/ref/annotated_junctions.tsv.gz --motif-correct --compilation-id 101 | tee jx_sqlite_import | bgzip -@ 7 > junctions.bgz &
>             sqlite3 junctions.sqlite -cmd '.separator " "' ".import ./jx_sqlite_import intron"
>             sqlite3 junctions.sqlite < /recount-unify/snaptron/deploy/snaptron_schema_index.sql
>         else
>             cat all.sjs.motifs.merged.tsv | pypy /recount-unify/annotate/annotate_sjs.py --compiled-annotations /container-mounts/ref/annotated_junctions.tsv.gz --motif-correct --compilation-id 101 | bgzip -@ 7 > junctions.bgz
>             touch junctions.sqlite
>         fi
>         tabix -s2 -b3 -e4 junctions.bgz
>         if [[ "True" == "True" ]]; then
>             /recount-unify/scripts/join_railID_to_sample_metadata.sh /container-mounts/working/ids.tsv /container-mounts/working/sample_metadata.tsv | sort -t' ' -k1,1n > sorted_samples.tsv
>             cat /container-mounts/working/ids.tsv.new_header sorted_samples.tsv > samples.tsv
>             /recount-unify/snaptron/deploy/build_lucene_indexes.sh samples.tsv all
>         else
>             touch samples.tsv lucene_indexed_numeric_types.tsv samples.fields.tsv
>             mkdir -p lucene_full_standard lucene_full_ws
>         fi ' returned non-zero exit status 255.
>   File "/recount-unify/Snakefile.study_jxs", line 299, in __rule_annotate_all_sjs
>   File "/opt/conda/envs/recount-unify/lib/python3.9/concurrent/futures/thread.py", line 52, in run
> Removing output files of failed job annotate_all_sjs since they might be corrupted:
> junctions.bgz, junctions.bgz.tbi, junctions.sqlite
> Shutting down, this might take some time.
> Exiting because a job execution failed. Look above for error message

Sample metadata file as:

study_id  sample_id
mirytest  DP9
mirytest  DP8

What I did notice is that ids.tsv.new_header file looks like (below), with a return between sample_id and study_id.

rail_id   sample_id
  study_id

Run folder looks like this at the end with the sums etc populated:

all.exon_bw_count.pasted.gz                       blank_exon_sums                   ids.input                          M023.gene_sums.snaptron_header.tsv  setup_links.run
all.exon_bw_count.pasted.gz.pre_existing          ERCC.gene_sums.tsv                ids.input.group_counters           M023.gene_sums.tsv                  SIRV.gene_sums.tsv
all.exon_bw_count.pasted.gz.samples_header        ERCC.gene_sums.tsv.gz             ids.tsv                            M023.gene_sums.tsv.gz               SIRV.gene_sums.tsv.gz
all.exon_counts.rejoined.tsv.gz                   exon_checks.tsv                   ids.tsv.new_header                 patroller_find_done.tsv             sorted_samples.tsv
all.exon_counts.rejoined.tsv.gz.accession_header  exon_sums.annotation_splits.jobs  ids.tsv.num_samples_per_study.tsv  qc_1.tsv                            staging
all.exon_counts.rejoined.tsv.gz.coords            exon_sums_per_study               ids.tsv.studies                    qc.err                              staging_jxs
all.gene_counts.rejoined.tsv.gz                   exon_sums.study_splits.jobs       input_from_pump                    recount-unify_latest.sif            unique.exon_bw_count.pasted.gz
all.intron_counts.rejoined.tsv.gz                 gene_checks.tsv                   intron_counts_summed.tsv           recount-unify.output.jxs.txt        unique.exon_bw_count.pasted.gz.pre_existing
all_logs_and_tsvs.pkl                             gene_sums_per_study               intron_counts_summed.tsv.err       recount-unify.output.sums.txt
all.logs.tar.gz                                   gene_sums.splits.ERCC.jobs        junction_counts_per_study          recount-unify.sums.stats.json
all.sjs.motifs.merged.tsv                         gene_sums.splits.M023.jobs        jx_sqlite_import                   run_unify.sh
assign_compilation_ids.py.errs                    gene_sums.splits.SIRV.jobs        links                              sample_metadata.tsv

Other than the above observation - a little bit lost as to how to fix this. Thank you!

Hi there, quick question what happened to this issue, did you solve it or did it just disappear ? We are getting the same error and are kind of stuck. Especially since I'm uncertain if this is caused by a bug or by the students not executing the code properly.

Cheers

ChristopherWilks commented 2 years ago

Hi @CallMeMisterOwl and @mdrnao

I don't think this is a bug as others have been successful at running the test sample.

Another thing to change in your command line, which I missed earlier is the reference, which should be hg38 rather than grcm38, as SRR390728 is a human sample. It's also possible Singularity 3.8.5 is an issue, most of our runs were done on Singularity 2.6, though that's a long shot. Beyond that I'm not sure what the problem is.

As far as further troubleshooting your runs, I can't promise anything as there's no on going, dedicated support for monorail-external---I'm no longer at Hopkins and there's no replacement person to maintain this currently, though that may change in future.

mdrnao commented 2 years ago

Hi @ChristopherWilks

So, I deleted everything, re-downloaded fresh and tried again carefully correcting as per your suggestions but stuck on the same error. I appreciate the attempts to help up to this point, despite you being in a new post! My group have just filled a software engineer post so I'll hand this off to them when they arrive as a trial by fire. I'll update you if we get to a solution. We're really keen to get this working!

@CallMeMisterOwl Considering I'm stuck, I wouldn't pay much notice as to how I got around the first error! If you're still stuck and want me to - I can go through my logs to check?

All the best, Holly

CallMeMisterOwl commented 2 years ago

@CallMeMisterOwl Considering I'm stuck, I wouldn't pay much notice as to how I got around the first error! If you're still stuck and want me to - I can go through my logs to check?

Hi @mdrnao I think that won't be necessary. We fixed it, seems like the sample_metadata.tsv was somehow corrupted. Thank you for the offer though and good luck with your issue, hope you can solve it.

mdrnao commented 2 years ago

Thanks for your help @ChristopherWilks, I've fixed the problem.

Looks like alias issues when collating the QC stats files (and any alias which goes near a cifs mounted drive, even with the symblinks flag).

All the best, Holly

ChristopherWilks commented 2 years ago

great, glad to hear it's working now, thanks for posting the follow-up @mdrnao

langmead-lab / monorail-external

Unify script failing at annotate_all_sjs #14