vgteam / giraffe-sv-paper

40 stars 7 forks source link

QUESTIONS in code of SV genotyping #9

Open NMUzhoujun opened 2 years ago

NMUzhoujun commented 2 years ago

Hi,

When I was converting the cram to the fastq, I found the code in your WDL workflow:

seq 0 ~{in_nb_chunks} | head -n ~{in_max_chunks} | parallel -j ~{in_cram_convert_cores} "samtools collate -k {} -K ~{in_nb_chunks} --reference ~{in_ref_file} -Ouf ~{in_cram_file} {} | samtools fastq -1 reads.{}.R1.fastq.gz -2 reads.{}.R2.fastq.gz -0 reads.{}.o.fq.gz -s reads.{}.s.fq.gz -c 1 -N -"

However, it seems that samtools collate doesn't have the parameter "k" or "K". Could you please make an explanation for this and check which parameter was used in this step

Thanks!

glennhickey commented 2 years ago

This line seems to come from vg_mapgaffe_call_sv_cram.wdl:

 seq 0 ~{in_nb_chunks} | head -n ~{in_max_chunks} | parallel -j ~{in_cram_convert_cores} "samtools collate -k {} -K ~{in_nb_chunks} --reference ~{in_ref_file} -Ouf ~{in_cram_file} {} | samtools fastq -1 reads.{}.R1.fastq.gz -2 reads.{}.R2.fastq.gz -0 reads.{}.o.fq.gz -s reads.{}.s.fq.gz -c 1 -N -"
    >>>
    output {
        Array[File] output_read_chunks_1 = glob("reads.*.R1.fastq.gz")
        Array[File] output_read_chunks_2 = glob("reads.*.R2.fastq.gz")
    }
    runtime {
        cpu: in_cram_convert_cores
        memory: "50 GB"
        disks: "local-disk " + in_cram_convert_disk + " SSD"
        docker: "jmonlong/samtools-jm:release-1.19jm0.2.2"
        preemptible: in_preemptible
    }

Which specifies this image: docker: "jmonlong/samtools-jm:release-1.19jm0.2.2". And the collate in there has -k

docker run jmonlong/samtools-jm:release-1.19jm0.2.2 samtools collate
Usage: samtools collate [-Ou] [-o <name>] [-n nFiles] [-l cLevel] <in.bam> [<prefix>]

Options:
      -O       output to stdout
      -o       output file name (use prefix if not set)
      -u       uncompressed BAM output
      -f       fast (only primary alignments)
      -r       working reads stored (with -f) [10000]
      -l INT   compression level [1]
      -n INT   number of temporary files [64]
      -k INT   the read chunk to output during CRAM conversion. In [0,N-1]. Used if N>0.
      -K INT   the number of read chunks to consider during CRAM conversion. 0 (default) means no chunking.
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]
  <prefix> is required unless the -o or -O options are used.

This is a customized samtools: https://github.com/jmonlong/samtools-jm

NMUzhoujun commented 2 years ago

Thank you. It has been solved