tjiangHIT / cuteSV

Long read based human genomic structural variation detection with cuteSV
MIT License
239 stars 33 forks source link

Start position #76

Open Akazhiel opened 2 years ago

Akazhiel commented 2 years ago

Hello!

I've seen in the changelog that for DUPs the start position was changed from 0-based to 1-based. What's the case for the rest of the variants, are they 0-based or 1-based?

Best regards,

Jonatan

Meltpinkg commented 2 years ago

Hello, @Akazhiel

Thanks for using cuteSV. According to VCF format, the POS of variants should be 1-based. So in cuteSV, all type of variants are 1-based.

Best, Shuqi

Akazhiel commented 2 years ago

Hello @tjiangHIT

Yes I am aware the VCF POS is 1-based, but I still had some doubts since cuteSV generates signatures on BED format and that is 0-based for the start position maybe a 1 should be added? My concern comes from the fact that I'm working with corrected reads meaning I have exact breakpoints since you can see them in IGV and the CIPOS and CILEN is 0 which leads me to believe the positions reported should be exact, but I find that the start position of the variants, let's say a deletion, starts 1bp upstream of where it should actually start.

Best regards,

Jonatan

Meltpinkg commented 2 years ago

Hello, @Akazhiel

The signature files that cuteSV generates is only the temporary file for outputing the detected SVs and it is not strictly BED format.

For your second doubt, I'd like to first clarify that according to VCF format, the POS of SV is the position of the last nucleotide on reference genome before the insertion of deletion sequence. In the newest version of cuteSV (v1.0.13), we have changed the SV position according to VCF format. For example, we suppose that the 1005th to 1054th (1-based and both border contained) nucleotide bases are deleted which shows a deletion of length 50bp. In this case the POS of this deletion should be 1004. It maybe appears like 1bp upstream the deletion sequence, but in fact it is the position of the last nucleotide on the reference genome.

Hope it will help!

Best, Shuqi

Akazhiel commented 2 years ago

Hello @Meltpinkg

I think there may be some problems on incongruences on how the start and end positions are computed. igv_snapshot

On this image I attached you can see how the positions for the start and end of the deletion are really well defined for both the Control and the Tumor samples. But for some reason this deletion is not captured when running CuteSV on the Control sample, and on the Tumor sample the start position is 2-3 bases upstream the actual position, possibly due to one or two reads that are in the wrong position and thus cause the start position to shift, but if it's only 1-2 reads, they shouldn't cause a shift in the position when all the rest of the reads which are plenty are calling the actual position.

Do you have any idea of what could be happening?

Best regards, Jonatan

Meltpinkg commented 2 years ago

Hello, Jonatan

Regarding the Control sample, it is unreasonable for the missing report of the deletion. I also felt confused and I thought about several possibilities. cuteSV filters reads whose number of split segments exceeds 7 or mapping quality below 20 in default. It may lead to the ignorance of the read signatures. So can I know your command of running cuteSV? Also, can I get part of your alignment dataset around the position in the above image if available? It will be helpful for me to check the reason for the missing deletion. For example, you can extract the reads around the deletion by the commands like samtools view {bam} -h -b -o check.bam chr14:104947000-104950000

Regarding the Tumor sample, cuteSV decides the breakpoint of the deletion by calculating the mean of the start position on the reads and use floor to transfer the float mean to integer. So it is possible that the breakpoint deviates few base pairs from the actual position. Also, to draw a more accurate conclusion, I may need part of the alignment dataset to make a debug. Hope it will help!

Best, Shuqi

Akazhiel commented 2 years ago

Hello,

Yes, the command I use for both samples is the same cuteSV -t 50 -S CUTESV_Normal -s 2 -L -1 -md 5 --genotype --max_cluster_bias_INS 1000 --diff_ratio_merging_INS 0.9 --max_cluster_bias_DEL 1000 --diff_ratio_merging_DEL 0. {}.bam {} CUTESV_Normal.vcf cutesv_normal/

Here I'm attaching also the reads for Tumor and Control from the region. check_normal.bam.gz

check_tumor.bam.gz

Best regard,

Jonatan

Akazhiel commented 1 year ago

Hello @Meltpinkg,

Is there any update on this? Did you have time to check it?

Best regards,

Jonatan

Meltpinkg commented 1 year ago

Hello, @Akazhiel

Sorry for my delay. I have run cuteSV on two alignment files you provided via your commands and the latest code in cuteSV's GitHub.

In normal file, cuteSV outputs the deletion: chr14 104945514 cuteSV.DEL.2 ... PRECISE;SVTYPE=DEL;SVLEN=-495;END=104946009 In tumor file, cuteSV outputs the deletion: chr14 104945518 cuteSV.DEL.2 ... PRECISE;SVTYPE=DEL;SVLEN=-495;END=104946013 And it can be seen in IGV that the actual position of this deletion is 104945515. image

The picture below shows the problem. The deviation is caused by the reads that I colored. Though the other reads have a neat signal position, this two reads appear a little shifting. Considering the error-prone sequencing, cuteSV doesn't provide exact match and allows some deviation in the detection. Also, these reads have high mapping quality, so they are not filtered, either. cuteSV takes all the reads into account to report the final position of SV, so the position of this deletion is influenced by these two reads. This strategy shows good performance on most experiments. At the same time, it won't influence too much, as in the example the position deviate 1bp and 3bp. image

If you have any further doubt, please feel free to reply. Hope it will help.

Best regards, Shuqi

Akazhiel commented 1 year ago

Hello @Meltpinkg

Thanks for the reply. I see you fixed the problem that there was with the detection of this deletion in the control sample. It is true that 1bp or 3bp shouldn't affect much in normal experiments. But for our use case we need to have the exact breakpoints of the deletion because we want to produce neoantigens from them, meaning even 1bp off changes drastically the result we would get. We are filtering variants by the CIPOS, field to make sure the ones we select are reported with the exact breakpoints, but as you've seen in this example it doesn't seem to be the case.

Is there any way by which we can overcome this challenge of the position being shifted by really few reads?

Best regards, Jonatan

Meltpinkg commented 1 year ago

Hello @Akazhiel

I got this condition that exact match for SV is important, and we didn't consider well about it in our tool. We plan to add a new parameter to fix this event and we will release the new version in several days as soon as possible. Thanks for your kind advice and dataset provided.

Best regards, Shuqi

Meltpinkg commented 1 year ago

Hello @Akazhiel

We add a new parameter --remain_reads_ratio which means cuteSV will filter part of reads which is relatively far from the variation position in the cluster. When the alignment datasets have very high quality, it will filter noises and get good performance. Otherwise it would filter several useful signatures. The default value is 1.0, which means all reads are considered. We recommand to set it lower when we need position with high accuracy by the high quality data. As it is shown in the picture, your dataset is very neat and have high quality. Therefore, in this condition, you can set the parameter as 0.9, which means we use the cloest 90% reads to generate the SV position. And both normal and tumor can get the right position. We have updated the new version on GitHub. Please git clone the latest code, reinstall it and try again.

Best regards, Shuqi

Akazhiel commented 1 year ago

Hello @Meltpinkg,

I just re-checked and the problematic one I commented on the issue seems to have been fixed. But in the BAM file I sent you for both the normal and the tumor, there are 2 deletions of 495bp, and one of them is detected on both the Normal and the Tumor, which was the problematic one with the position, but there's another one, on the position chr14:104,948,835-104,949,329 which is not detected on the Normal even before this new commit on the tool.

Would you be so kind as to check why this could be happening? Because on the IGV capture the deletion is really there on both but I'm unsure of the reason why it's not being detected in the Normal.

Best regards, Jonatan

Meltpinkg commented 1 year ago

Hello, @Akazhiel

The deletion is not detected in normal because there are several deletion signatures between this 495bp deletion and the next 1980bp deletion, so they are clustered into one group. And according to previous experiment, cuteSV is inclined to report the SV with highest reliability. So if there are two signatures in the same read in one cluster, cuteSV only remains the longest signatures. So the deletion of 1980bp remained and 495bp filtered. Therefore, to solve it you can decrease the parameter max_cluster_bias_DEL. This parameter means the maximum distance between signatures clustered into one group. Because your alignment data is very neat, you can decrease this parameter to about 500 or even lower to avoid the wrong cluster. I try to run the normal with --max_cluster_bias_DEL 300, the result seems to be concordant to IGV.

Best regards, Shuqi