mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
175 stars 16 forks source link

Continue Jasmine from where it failed #21

Closed mariesaitou closed 3 years ago

mariesaitou commented 3 years ago

Dear Melanie,

Hi, based on your advice, I was running Jasmine on my eleven samples. I have two questions regarding my job.

(1) The job stopped before finishing. There is no obvious error in the output file. There are several completed (I guess) "XX(0-9)_minimap.SVs.phased_dupToIns_irisRefined.vcf" files, and one unfinished "10_minimap.SVs.phased_dupToIns_irisRefined.vcf" file in the output file.

Slurm Job_id=13504607 Name=jasmine629 Ended, Run time 14-14:16:44, FAILED, ExitCode 1

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
13504607     jasmine629    hugemem       nmbu         60     FAILED      1:0 
13504607.ba+      batch                  nmbu         60     FAILED      1:0 

Is it possible to continue the job where it stopped (the individual 10), rather than starting everything over? As it took so long so far.

(2) It took more than 14 days to process 10 samples and produce "minimap.SVs.phased_dupToIns_irisRefined.vcf" files so far. Is that expected? is there a way to accelerate the process? I put my script below.

#!/bin/bash
#SBATCH --ntasks=60               # 1 core(CPU)
#SBATCH --nodes=1                # Use 1 node
#SBATCH --job-name=jasmine629
#SBATCH --mem=30G                 # Default memory per CPU is 3GB
#SBATCH --partition=hugemem,orion
#SBATCH --mail-user=marie.saitou@nmbu.no # Email me when job is done.
#SBATCH --mail-type=END
#SBATCH --output=%x_%J_%a.out

module load Anaconda3
conda activate jasmine
module load SAMtools
module load HTSlib
module load VCFtools

vcf_list='./vcf.list'
bam_list='./bam.list'
genome_file='/CHR_selected.fa'

jasmine file_list=vcf.list out_file=./jasmine11.629.vcf genome_file=/CHR_selected.fa bam_list=bam.list threads=60 --output_genotypes --normalize_type --dup_to_ins --run_iris iris_args=--keep_long_variants  --default_zero_genotype --ignore_strand

I thought threads=60 works throughout the job, but perhaps I have misused this option.

Thank you very much, again.

mkirsche commented 3 years ago

Hi,

As for your first question, there is not currently a way to resume a failed run of Jasmine. My recommendation would be to run Iris separately on each remaining sample by running iris with the appropriate arguments from the same conda environment which has Jasmine or by running Jasmine with --run_iris and --preprocess_only, and then to run Jasmine on the list of Iris-refined VCFs without the run_iris argument. It would look something like this for each sample:

jasmine --dup_to_ins --preprocess_only vcf_filelist=sample_minimap.SVs.vcf --comma_filelist
iris threads=20 vcf_in=output/sample_minimap.SVs.phased_dupToIns.vcf genome_file=genome.fa reads_in=sample.bam vcf_out=output/sample_minimap.SVs.phased_dupToIns_irisRefined.vcf

As for speeding up Iris, the number of threads is specified separately for Iris, and you can modify it as part of iris_args (e.g., iris_args=--keep_long_variants,threads=20). This should significantly speed up that part of your pipeline.

I hope that helps! Melanie

mariesaitou commented 3 years ago

Thank you very much. Now I tried the first preprocess part.

jasmine --dup_to_ins --preprocess_only file_list=${SLURM_ARRAY_TASK_ID}.vcf.list  --comma_filelist
iris threads=20 vcf_in=output/${SLURM_ARRAY_TASK_ID}_minimap.SVs.phased_dupToIns.vcf genome_file=/net/fs-1/Transpose/Software/fasta/Simon_Final2021_CHR_selected.fa reads_in=${SLURM_ARRAY_TASK_ID}.bam.list vcf_out=output/${SLURM_ARRAY_TASK_ID}_minimap.SVs.phased_dupToIns_irisRefined.vcf

Then I got: Error: Cannot convert duplications to insetions without a genome file specified (I am not sure if "--comma_filelist" is a real option or you are referring to something else.)

Then I tried:

jasmine --dup_to_ins --preprocess_only file_list=${SLURM_ARRAY_TASK_ID}.vcf.list  bam_list=${SLURM_ARRAY_TASK_ID}.bam.list out_file=output genome_file=/net/fs-1/Transpose/Software/fasta/Simon_Final2021_CHR_selected.fa --run_iris iris_args=--keep_long_variants`

Then got an error, which I did not get in any previous runs.

Exception in thread "main" java.lang.Exception: samtools faidx did not produce an output: samtools faidx /net/fs-1/Transpose/Software/fasta/Simon_Final2021_CHR_selected.fa ssa01:132539-133059
    at GenomeQuery.genomeSubstring(GenomeQuery.java:60)
    at DuplicationsToInsertions.convertFile(DuplicationsToInsertions.java:76)
    at PipelineManager.convertDuplicationsToInsertions(PipelineManager.java:41)
    at Main.preprocess(Main.java:35)
    at Main.main(Main.java:17)

The fasta file is the same one as I used previously. samtools faidx works and produces the index file in the original directory when I run it manually.

Do you have any idea to move it forward?

Thank you very much!!

mkirsche commented 3 years ago

Hi,

As for the first issue, that was my mistake in the last reply! You need to also pass the genome_file parameter in that first command.

For the second issue, it looks like you somehow have a variant call which is outside the range of the genome. What happens when you run just samtools faidx /net/fs-1/Transpose/Software/fasta/Simon_Final2021_CHR_selected.fa ssa01:132539-133059 (the command that failed)? My guess is that the ssa01 sequence is shorter than 133kbp and that's causing Jasmine to fail when attempting to extract the duplicated sequence for one of the duplication SV calls.

Melanie

mariesaitou commented 3 years ago

Well, it gives me the sequence... and I am not sure about that because previously (two weeks ago) Jasmine worked with the same VCF and Fasta set for some samples until it stopped...

(base) [mariesai@login jasmine629]$ samtools faidx /net/fs-1/Transpose/Software/fasta/Simon_Final2021_CHR_selected.fa ssa01:132539-133059
>ssa01:132539-133059
GGATTGATGCATCAGTACGTGAATCAAAGTAAACGAATGTAAAAATACCCATTCAAATCG
ATCAGTTTTCAACTAGAGAGATTGAGATGTGTCGTATCTGTCTCAAGCCTCCGTAACTGT
TTATGTTACACCTCCACATCTGCAATGAAAGGTGGTGGAGCTAGAGCGGTGTTAGTTAAA
CCATGAGACATACCTATAATCGGTCTTCTCACAAAAACATCTGTAGCTTCTGAACCGTTT
GACCTACAAAATCGTATGACCACTCCCTGGAAAGGGGAGACCCTCATGAACACAGTGGTG
TTCTCAATTTGTCTCTACGTCCCCCACAAGTGTCAGCGGACTCGTCTGAAGTCGGAACCC
TTTATGTGCCAACTTCTGTTTGTAGAATCTAAAATCATTGGGCTACAAACTAATGTGACC
CAACTTGTCACAATGTTCTCAGTTTTGCCCTACGTTCCCCACAAGTGTCACGGGACTAAT
CTGAAGGTAACAGGTACTGATTAAAATGATAGATTTTTTAA

And this is a subset of the previous output file.

(base) [mariesai@login jasmine629]$ grep "ssa01:132539" jasmine629_13504607_4294967294.out 
(base) [mariesai@login jasmine629]$ grep "ssa01:133059" jasmine629_13504607_4294967294.out 
Starting to process ssa01:133059:INS:1 (time = 00:17:56:52.824)
Found 2 relevant reads for ssa01:133059:INS:1 (time = 00:17:56:53.080)
Found 1 consensus sequences for ssa01:133059:INS:1 (time = 00:17:56:53.185)
Found 1 alignment records for ssa01:133059:INS:1 (time = 00:17:56:53.283)
Found refined SV of new length 545 and new pos 133054 for ssa01:133059:INS:1 (time = 00:17:56:53.284)
Done processing ssa01:133059:INS:1 (total processed = 84128) (time = 00:17:56:53.284)
Outputting refined insertion for ssa01:133059:INS:1 (time = 01:01:13:06.852)
Starting to process ssa01:133059:INS:1 (time = 00:04:50:49.741)
Found 2 relevant reads for ssa01:133059:INS:1 (time = 00:04:50:49.939)
Found 1 consensus sequences for ssa01:133059:INS:1 (time = 00:04:50:50.021)
Found 1 alignment records for ssa01:133059:INS:1 (time = 00:04:50:50.075)
Found refined SV of new length 551 and new pos 133016 for ssa01:133059:INS:1 (time = 00:04:50:50.075)
Done processing ssa01:133059:INS:1 (total processed = 37672) (time = 00:04:50:50.075)
Outputting refined insertion for ssa01:133059:INS:1 (time = 00:11:43:24.057)
mkirsche commented 3 years ago

That is very strange - the only thing I can think of to cause it is that it might be somehow pointing to a different version of samtools than you were using before. Could you please try the jasmine command again but also add samtools_path=/path/to/samtools/executable (after verifying that the specific samtools you are pointing to is able to run that samtools faidx command and obtain a sequence)?

Thanks, Melanie

mariesaitou commented 3 years ago

I see... that is odd since I used the exact same command (the same SAMTools, 1.11)

Base environment command (worked):

(jasmine) [mariesai@login jasmine715]$ samtools 

Program: samtools (Tools for alignments in the SAM format)
Version: 1.11 (using htslib 1.11)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     fqidx          index/extract FASTQ
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates
     ampliconclip   clip oligos from the end of reads

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     coverage       alignment depth and percent coverage
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)
     ampliconstats  generate amplicon specific stats

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

(jasmine) [mariesai@login jasmine715]$ samtools faidx /net/fs-1/Transpose/Software/fasta/Simon_Final2021_CHR_selected.fa ssa01:132539-133059
>ssa01:132539-133059
GGATTGATGCATCAGTACGTGAATCAAAGTAAACGAATGTAAAAATACCCATTCAAATCG
ATCAGTTTTCAACTAGAGAGATTGAGATGTGTCGTATCTGTCTCAAGCCTCCGTAACTGT
TTATGTTACACCTCCACATCTGCAATGAAAGGTGGTGGAGCTAGAGCGGTGTTAGTTAAA
CCATGAGACATACCTATAATCGGTCTTCTCACAAAAACATCTGTAGCTTCTGAACCGTTT
GACCTACAAAATCGTATGACCACTCCCTGGAAAGGGGAGACCCTCATGAACACAGTGGTG
TTCTCAATTTGTCTCTACGTCCCCCACAAGTGTCAGCGGACTCGTCTGAAGTCGGAACCC
TTTATGTGCCAACTTCTGTTTGTAGAATCTAAAATCATTGGGCTACAAACTAATGTGACC
CAACTTGTCACAATGTTCTCAGTTTTGCCCTACGTTCCCCACAAGTGTCACGGGACTAAT
CTGAAGGTAACAGGTACTGATTAAAATGATAGATTTTTTAA

And this is the submitted job on our cluster.

#!/bin/bash
#SBATCH --nodes=1                # Use 1 node
#SBATCH --job-name=jasmine715
#SBATCH --mem=30G                 # Default memory per CPU is 3GB
#SBATCH --partition=hugemem,orion
#SBATCH --mail-user=marie.saitou@nmbu.no # Email me when job is done.
#SBATCH --mail-type=END
#SBATCH --output=%x_%J_%a.out

module load Anaconda3
source activate /mnt/users/mariesai/.conda/envs/jasmine
samtools 
samtools faidx /net/fs-1/Transpose/Software/fasta/Simon_Final2021_CHR_selected.fa ssa01:132539-133059

Output file (with error):

(jasmine) [mariesai@login jasmine715]$ sbatch jasmine715.slurm
Submitted batch job 13524596
(jasmine) [mariesai@login jasmine715]$ more jasmine715_13524596_4294967294.out

Program: samtools (Tools for alignments in the SAM format)
Version: 1.11 (using htslib 1.11)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     fqidx          index/extract FASTQ
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates
     ampliconclip   clip oligos from the end of reads

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     coverage       alignment depth and percent coverage
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)
     ampliconstats  generate amplicon specific stats

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

/var/tmp/slurmd/job13524596/slurm_script: line 35: 62680 Illegal instruction     samtools faidx /net/fs-1/Transpose/Software/fasta/Simon_Final2021_CHR_selected.fa ssa01:132539-133059
mariesaitou commented 3 years ago

Well, this started working (seemingly). I just added " --preprocess_only" to my previous command, and separated each sample for each job, rather than providing all the input files together. I am still not sure what caused the SAMTools issue, but I'll observe the running jobs for a while.

jasmine --preprocess_only  file_list=${SLURM_ARRAY_TASK_ID}.vcf.list out_file=./jasmine11.715.vcf \
 genome_file=/net/fs-1/Transpose/Software/fasta/Simon_Final2021_CHR_selected.fa \
bam_list=${SLURM_ARRAY_TASK_ID}.bam.list threads=60 \
 --output_genotypes --normalize_type --dup_to_ins --run_iris iris_args=--keep_long_variants \
 --default_zero_genotype --ignore_strand

Thank you very much!

mkirsche commented 3 years ago

I'm glad it seems to be working when you separate it out - hopefully it stays that way!

As for the other run, it looks like some kind of cluster issue if it won't even let you run that samtools faidx command in isolation. It could just be that something has changed with the cluster configuration since you were running it before, but here are a few other things that might be worth trying if you start running into issues again:

mariesaitou commented 3 years ago

Great, thank you very much!

mariesaitou commented 3 years ago

Sorry, one more question - when I run the Iris preprocess for multiple samples in parallel, should I separate them into different directories? Does it use the same name for the temp files (chr20:21495479:INS:384589.sam etc.) and could mix up them for multiple samples?

mkirsche commented 3 years ago

That's a good point! It wouldn't be necessary normally when running the iris refinement with a single jasmine command since the samples would be refined one at a time, but if you have separate slurm jobs for all samples there's some potential for overlap. I think it'd be worth adding out_dir=path/to/[sampleid]/outdir as part of iris_args just to be extra safe, but I think it's extremely unlikely that you'd run into overlaps in practice anyways since each thread processes a single variant at a time and deletes those intermediate files before moving onto the next variant.

mariesaitou commented 3 years ago

Great, thanks a lot for your prompt responses!