twolinin / longphase

GNU General Public License v3.0
102 stars 9 forks source link

The parameter setting in your pipeline of variant calling #91

Open Jerry-bioinformatics opened 4 days ago

Jerry-bioinformatics commented 4 days ago

Hello, in LongPhase paper, you utilized minimap2 to align sequences to reference, and then performed variants calling using PEPPER-WhatsHap-DeepVariant pipeline. However, to my understanding, when applying minimap2, the output file format is .sam, but the input of variant calling pipeline is .bam. Therefore, it is necessary to utilize samtools toolkit to transfer .sam format to .bam format. When using samtools, a parameter (-F or -f) named flag is used to filter some unnecessary mapping results. I am not sure how to set flag in long reads, so I ask for your help about your parameter settings in LongPhase. Thanks very much.

twolinin commented 11 hours ago

Hi @Jerry-bioinformatics,

In our input-preparation, we use samtools sort to convert .sam file to .bam file.

# sort alignment file
samtools sort -@ 10 alignment.sam -o alignment.bam

We did not use the FLAG field to filter any alignments during the data preparation stage. Instead, our program use the FLAG information to select specific alignments, such as unmapped reads, primary, secondary, supplementary, etc. I suggest keeping the complete alignment SAM/BAM file to allow downstream analysis software to filter the alignments they need.

Thanks

Jerry-bioinformatics commented 11 hours ago

@twolinin Hello, I tried small variant calling tool clair3 setting different FLAGs [combination of 4079(only primary), 3823(secondary plus), 2031(supplementary plus), 1775(secondary and supplementary plus)] in samtools, and the output vcf keep consistent. However, when calling SV using cuteSV for example, different FLAGs in samtools sort may lead to different output in cuteSV output vcf. I consider that SV calling like cuteSV is not able to filter all kinds of specific alignments including unmapped, secondary and supplementary.

twolinin commented 11 hours ago

@Jerry-bioinformatics

Sniffles and cuteSV utilize different alignments to detect various structural variants (SVs). For example, a single read with both primary alignment and supplementary alignment might indicate the detection of a large deletion. Similarly, the presence of secondary alignment in the same reference region could suggest a duplication. As a result, different alignment outcomes will produce correspond VCF files.