tjiangHIT / cuteSV

Long read based human genomic structural variation detection with cuteSV
MIT License
253 stars 36 forks source link

Wrong position #43

Open Akazhiel opened 3 years ago

Akazhiel commented 3 years ago

Hello,

I've noticed that some variants are reported as PRECISE and with a PASS filter but when taking a look at the alignment in the IGV the position reported is wrong in comparison to the actual precision of the structural variant. image

As it can be observed the VCF file reports an insertion at position 143202282 in chromosome 1, while the actual position that can be observed is at 143202274 at chromosome 1. There's also the fact that the VCF states the reference is a T while in the position reported is an A, nonetheless in the actual position of the insertion the reference is indeed a T. Why could this be happening?

EDIT: I'd also like to mention there are variants reported that don't actually appear in the BAM files.

Best regards,

Jonatan

tjiangHIT commented 3 years ago

Hello @Akazhiel,

Thank you for your interest in cuteSV and use it.

This is not a wrong INS call. Before explaining this issue, let me introduce the production of the insertion breakpoint briefly. cuteSV firstly extracts the INS signatures from the BAM file and generates the SIGS files. After that, cuteSV sorts the INS signatures by position and performs clustering. Finally, cuteSV determines the final INS from each cluster and calculates the genotypes.

As you can see, the determination of the cluster is very important for producing the final INS call. cuteSV applies a parameter named _max_cluster_biasINS to decide which signatures will be taken into account. That is, the two INS signatures will be used for the same INS if their maximum distance less than the value. Usually, the default value of _max_cluster_biasINS is set as 100 for CLR/ONT and 1000 for CCS datasets respectively.

From your example, there are lots of INS signatures in the above chr1:143202274. For the limitation of the range of the window of the IGV is only 41 bp, I believe there must be other INS signatures in the positions larger than chr1:143202274 but less than the _max_cluster_biasINS. As a result, cuteSV reputes these signatures as sources for the generation of the final INS chr1:143202282. Due to there are enough reads to support this INS which also has a great variant quality, cuteSV gives PRECISE tag and PASS filter finally. Also, the base T on the chr1:143202282 is treated as the reference allele.

To solve this, you can decrease the _max_cluster_biasINS value to exclude few outliers. But this will sacrifice the recall especially for those calls whose signatures are extremely messy. Besides, there are indeed several SVs with messy signatures that have unintuitive breakpoints, which I guess are the calls you mentioned in EDIT at last.

Hope this answer can help you solve your confusion!

Best, Tao

Akazhiel commented 3 years ago

Hello @tjiangHIT,

I understand this max_cluster_bias_INS distance is measured in bp? If that is the case there are no other insertions in a range of 100bp from the one shown. That is why it seemed odd for cuteSV to place the INS in the wrong position when checking the alingments in the IGV. image

I think in this image it can be better appreciated how there are no other INS in the specified range. There's also the fact that the reference allele for the position reported by cuteSV is not a T but an A, or at least that's what the reference sequence says, it could be it says T since it's the complementary, I'm unsure as to how it handles all that. But, if you were to check the position at which the actual insertion happens which is at 143202274 you can observe that the reference allele in the sequence is actually a T.

As for the EDIT if you check the picture in this post, on the rightmost position you can see a DEL labeled as PRECISE but nonetheless it doesn't span the whole supposed deletion and this supposed deletion can be observed in both BAM files yet it's only reported in one of them. Could this be due to these BAMs being subsets since I don't have enough computation power to run a whole sample and thus not having enough supporting reads to call the mutation accurately?

Best regards, Jonatan

tjiangHIT commented 3 years ago

Hello @Akazhiel,

cuteSV extracts INS signature not only from the intra-alignment (there is an over 50 bp I in cigar field), but also from the inter-alignment (with 2 or more split alignments). And those INS signatures from split-alignments might be hard to see from IGV. So could you supply the detailed INS signatures at the range of 1000 bp around chr1:143202274 from the file named INS.sigs?

For the next point, I am sorry for my carelessness. Actually, cuteSV computes the final breakpoint of the INS from all signatures, and then it extracts the base from the reference genome on the location of breakpoint-1 as the final REF allele.

As for the deletion, there are respectively 13 and 8 reads at least support this deletion from the second shot-screen of IGV. Please check the value of the parameter _--minsupport used in your practice. The default value of _--minsupport is 10 and this will ignore the deletion call with about 8 supporting reads.

Best, Tao

Akazhiel commented 3 years ago

Hello @tjiangHIT,

Here it's attached the subset you asked for. INS_subset.txt

Regarding the reference allele, I'm still not sure I understand it since if I'm not wrong the breakpoint-1 is the first position that is reported for that variant in the VCF file, which in this case is 143202282, but still the REF allele from the reference genome in that position is still not a T but an A, so it is wrongly reported.

And for the --min-support I understand it now, but I still don't understand why the deletion variant didn't span the whole area that can be seen in the image but just a small part, I guess it has to be related to how the breakpoint is computed with the reads that support that variant.

Best regards, Jonatan

kpalanikannan commented 3 years ago

I agree with @Akazhiel . Whatever limitation it can be, even it can be false positive SV call. But, the position and ref allele should not be different. In my case, I have noticed it in all variants irrespective of signatures, deletion or insertion.

tjiangHIT commented 3 years ago

Hello @Akazhiel and @kpalanikannan,

Thank you for sharing the INS signatures. As you can see the Table below, there are 12 INS signatures with bold POS used for generating the final INS. The average POS is 143202281. Considering the breakpoint in VCF is 1-based, cuteSV output 143202282 as the final breakpoint for this INS. Due to the insertion appearing at 143202282, cuteSV treats the base on 143202281 from reference genome as the reference allele.

Type CHR POS LEN ID SEQ
INS chr1 143201977 49 DRR258589.4155776 AATTCCATCACCATCGAATGCAATCGAATGGAATCATCGAATCGACTCA
INS chr1 143202136 465 DRR258589.87960 TGGAAACCGAATGGAACTCACAATGGAAATGAAAATATCATCTAATGGAATCGCAATGGAATCATCAAATGGAATCGAATAATTAATCATCATCATCGAATTCGATGTGGAATCATTGAACAGAATTGAATGGAATCGTCATCGAATGAATTGAATGCAAATCATCGAATGGTTCTCGAATGGGAATCGTCTTCGAATGGAAAAGGAAATGGAATCATCGCATAGAATCGAAATAATAATAATGAAATGCAACTGAATGGTATCAACACCAAACGAGAAAACGGAATATCGAATGGAACGAAGAGAATCTTCGGAACGGACTCCGAAATGGAATCATCTAATGGAATGGAAATGGAATAATCCACATCATAAGGCAACGAATACTTGAATGGAATCGAATGGAATCATCGAATGACTCTTCGAATGAACAATCATTGAACGGAATCATGAATGGAATCATCATCG
INS chr1 143202233 157 DRR258589.4083242 ATTGAACAGAATTGAATGGAAATCGTCATCGAATAATCAGAATGCAATCATCGAATGGTCCGAATGGAAATCATCAACACAAATGGAAAAATGGAATTATCGAATGGAATCAAAGAGAATCATCGAACGGACCTGAATGGAATCATATCGAATGGTT
INS chr1 143202267 51 DRR258589.1418403 TTCTGGAATCGTCATCAATGAATTGAATGCAAAAATCATCCAATGGACTCA
INS chr1 143202271 186 DRR258589.2196802 CAATGAATCAGTCGAATGAATTGAATGCAATCATCGAATGTCTCGAATGGAATCATCTTCTAATGGAAAGGAATGGAATCATCACAGAATCGAATGGAAATAATATGGAATGGACTGAATGGAATCAACATCAAACGGAATCAAGTGGAATTATTGAATGGAATCGAAGAGAATCATCGAATGGAC
INS chr1 143202275 183 DRR258589.2277296 CATCGAATGAATCGAATGCAATCATCGAATGGTCTCGAATGGAATCATCAATGGAAAGGAATGGAATCATCGCATAGAATCGAATGGAATTATCATCGAATGGAATCGAATGGTAATCAACACCAAACGGAAAAACGGAATTGTCGAATGGAATCGAAGAGAATTCGAACGGACCCGAATGGA
INS chr1 143202275 189 DRR258589.5162266 CGTCATCAAATGAATTGAATGCAATCATCGAATGGTCTCGAATGGAATCATCTTCTAATGGAAAGGAATGGAATCATCGCATAGAATCGAATGGAATTATCATCGAATGGAATCGAATGGAATCAAACAAACGGAAAAACGGAATTATCAAATGGAATCGAAGAGAATCTTCGAACGGAACGGAATGGA
INS chr1 143202275 191 DRR258589.3365780 CGTCATCGAATGAATTGGAATGCAATCATCGAATGGTCTCGAATGGAATCATCTTCGAATGGAAAGGAATGGAATCATCACAGAATTGAATGGAATTAACATTGAATGGACTCGAATGGAATCAACATCAAACGGAATCAAAGTGTAATTATCGAATGGAATCGAATATAATCATCATGGACTCGAATGGA
INS chr1 143202278 183 DRR258589.4083242 AATCATCGAATGAATTGAATGGAATCATCGAATGGTCTCGAATGGAATCATTTCTAATGGAAAGAATATGGAATCATCGCATAGAATCGAATGGAATTATCATCGAATGGAATCGAATGGTATCAACACCAAACGGAAAAACAGAATTTTCGAATGAATCAAAGAGAATCTTCGAATGGACCC
INS chr1 143202278 184 DRR258589.1242385 GATCGAATGAATTGAATGCAATCATCAATATTCGAATGGAATCATCTTCAAATGGAAATGGAATGGAATCATCGTACAGAATCGAAAGAATTATCATCGAATGGAATCGAATGGTATCAACACCAAACGGAAAACGGAATCATCGAATGGAATCGAAGAGAATAATCGAATGGAATCGAATGGA
INS chr1 143202278 193 DRR258589.2700415 CGTCATCGAATGAATTGAATGCAATCATCGAATGGTCTCGTCGGAATCATCATCTAATAATGGAAAGGAATGGAATCATCACATAGAATCGAATGGAATTAACATTGAATGGAATTGAATGGAATCAACATCAAAGGAATCAAGCGGAATTATCGAATGGAATCAAGAGAATCATCGAATGGACTCGAATGGA
INS chr1 143202279 182 DRR258589.4400106 CATCATCGAATGAATTGAATGCAATCATCGAATGGTCTCGATCGGAATCATACTAATGGAAAGAATGGAATCATAGAATCGAATCGGAATTATCATCGAATGGAATCAGATAGTATCAACACCAAACGGAAAAAACAGGTATCGAATGAATCGAAGAAGAATCTTCGAACGGACCGAATGGA
INS chr1 143202279 191 DRR258589.3301887 CGTCATCGAATGAATTGAATGCAATCATCGAATGGTCTCAATGGAATCATCTTCAAATGGAAAGGAATGGAATCATCGAATAGAATCGAATGGAATTATCATCGATTGAAACGAATGGTATCAACATCAAACGGGAAAAACGGAATTGATGGAATGGAATCGAAGAGAATCTTCGAATGGACTCGAATGGA
INS chr1 143202279 193 DRR258589.2831092 CGTCATCGAATGAATTGGAATGCAATCATCGAATGGTCTCGAATGGAATCATCTATCAATGGAAAGGAATGGAATCATGCATAGAATCGAATGGAAATCAACATTGAATGGACTCGAATGGAATCAACATCAAACGGAATCAAATGGAATATCGAATGGAATCGAAGAGAATCATCGAATGGACTCGAATGGA
INS chr1 143202284 185 DRR258589.87960 GAATGAATCGAATGCAATCATCGAATGGTCTCGAATGGAATCATCTTCAAATGGAATGGAATGGAATATCGCATGAATCGAATGGACAATTCAATGAAATTTAAATCGAATGGAATCAAACATCAAACGGAAAACGGAATTAATGAATGGAATGAAGAGAATCATCGAATGGACTCGAATGGAAT
INS chr1 143202326 164 DRR258589.159181 TCGGAATAGAAATATAAACAGAATCATCAGAATCATCTAATTATACCGACTGAACTTAGCCTTCCAGTCAACAAATGGAAAACGGAATATTGATAATCAATATAACTCGAATAATCTGAATGGAACTATCCAATGGAATGAAAGCAATAACTATGACTAATAAA
INS chr1 143202859 419 DRR258589.4083242 GAATCGAATGGAACTTCGTTCACTCCATTTGGACAAAGGAGTCATCCAATGGAATTTGCAATAATCATCATAAATGGAATCGAATGGAATCAACATCAAATGGAATCAAATGGAATCATTGAACGGAATCGAATGGAATAATGAACAATGAATAAAATCATCGAATGGTCTCAATGGAATCATATTCATGAAATGAATGTGGAATGAATCATCAAGAATCATTAATTATCAATGAATGGAATCGAATGGAATCAACATCAAACGGAAAAACGGAATTATCGAATGGAATCATTATAATCATGAATGGACCTGAACAAAATCATTCAATGGAGTCAGAATGATCAATCACACTCGAATGCAATCATCGAATGGAATCAATGGAATCATCGAATGGACTCGAATGGAATAATCATTGAATC

Do you think the breakpoint and the REF allele of this INS should be 143202281 or 143202282? I have to say that this is a problem that has plagued me for a long time.

Best, Tao

Akazhiel commented 3 years ago

Hello @tjiangHIT,

Thanks for the quick replies and always taking the time to explain everything in a clear way. Your message actually helped to understand why the reference allele in the VCF is a T and not an A since as you can see in the IGV the reference for position 143202282 is an A.

So if I understood correctly, the base that is used as a REF in the VCF is actually the POS - 1 but still the REF position reported in the VCF is POS, in this case 143202282 and the end of the breakpoint is this same position. Maybe the REF position in the VCF should be the average POS that is returned by the pipeline and the end of the breakpoint be this same position since it's an insertion and it's also how it's reported in databases such as gnomAD.

Nonetheless, I still have doubts regarding the INS signatures since in this table the positions are different, but if you check the IGV screenshots I attached previously, you can see how most insertions are located in the same place. What could be the reason of this difference? Maybe it's just me misunderstanding insertions since I'm new to all the structural variation field.

tjiangHIT commented 3 years ago

Hello @Akazhiel,

I have reviewed your first IGV shot screen, there are over 20 reads containing the INS signatures. But cuteSV only applied about 12 signatures, which are scattered than those in IGV, to finish the INS calling. I think the key point might be that several reads containing signatures are with low mapping quality or many numbers of split-alignments, and cuteSV will discard those reads that have less than 20 mapping quality or over 7 split parts. Please check this from your IGV. Hope this can help you!

Best, Tao

Akazhiel commented 3 years ago

Hello @tjiangHIT,

Reading your answers I reach the conclusion that although you can see in the screenshots multiple insertions that are in the same position, this is not the position reported in the VCF file and that could be due to the reasons you have stated which is the possible low mapping quality and the split parts. How can I check the split parts in my alignment? Double-checking my alignment file and the insertion called, I can see there are way more than the minimum read support of 10 reads supporting the insertion at the same position yet the insertion is reported in a different position. Is it possible that IGV is actually displaying the insertions in a wrong way making it seem as if they are all in the same position?

Best regards, Jonatan

Meltpinkg commented 3 years ago

Hello @Akazhiel ,

First, you can have a look at the alignment file and in the 18th column, there will be a string started with 'SA' which represents the supplementary alignments. In this field, you can check the split parts of this read. Then, to check whether IGV displays the insertions in a wrong way, I recommand that you can calculate the true position of the insertion signatures manually from the alignment file. In the bam file, the 4th column represents the start position of the alignment and the 6th column(known as CIGAR) represents the alignment results. You can count the length of sequence in reference before the insertion signature(I) from the number of soft-clips and matches and deletions. After getting the true position of the insertion from bam file, you can check whether the position corresponds to the IGV screen.

Best, Shuqi