niu-lab / msisensor2

Microsatellite instability (MSI) detection for tumor only data.
GNU General Public License v3.0
88 stars 21 forks source link

Questions on count distribution from *_dis files #41

Open AlkaidWang opened 11 months ago

AlkaidWang commented 11 months ago

Mr./Ms. ,

I checked the count distributions from _dis files, and compared the numbers with the read counts from BAM files. For a specific MSI locus, I thought that the sum of the number of count distribution equal to the total reads from the BAM files. I checked the read count through IGV, but found that the sum of the number from _dis files always less than the total read count from BAM files through IGV. So I'm wondering is there any filtration process when counting the read count?

For example:

the MONO27 MSI site chr2 39573062 GTCTC 27[A] GAGTG T: 0 0 0 0 0 0 0 0 0 0 0 16 34 104 263 434 639 674 507 293 104 18 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The sum of the count is 3091, however, the total reads is 3896 at the start of the MONO27 site and 4341 at the end of the site, using IGV.

Look forward to your reply, thanks.

ZhaoDanOnGitHub commented 11 months ago

您好,您的邮件已收到!

observer2735 commented 11 months ago

Dear @AlkaidWang,

Thank you for your support and trust in our work, what you mentioned is a very noteworthy issue. During the development process of MSIsensor, TCGA groups can only get reads for the length of 100 bps. So in the code, the longest span is 110 bp.For example, in the loc you mentioned named as MONO27, it looks like that :

-110bp GTCTC 27[A] GAGTG +73bp

The first 110 bp is because of the setting of MSIsensor is 110 bp (you can search it at Ding lab for search key word "MAX READ LENGTH "), and the second 73 bp is because 110bp-5bp-5bp-27bp (from the start of this loc). The complexity of the code setup is due to the inability to easily obtain reads from the target area at the beginning of MSIsensor development, and only can use "bam_fetch" function in samtools lib.

Maybe your max length of read is longer to 110 bp, and the output is so confused to you. At the publishment msisensor, there was no problem doing so, because every user wants to use MSIsensor must compile it from source code, and the parameter can easily changed by the user. But now you need to use MSIsensor 2, which didn't publish source code, so it may can't cover all reads that you checked in IGV.

I'm very apologize for the inconvenience caused to you, as far as I know, the inconsistency in the number of reads displayed between MSIsensor2 and IGV does not significantly affect the detection results of MSIsensor2. Because after you input the data into MSIsensor 2, it will execute a process named "Normalization", so your input reads will changed into reads frequency.

At the next publishment version we will fix this bug, and I hope it will not affect your user experience so far.

Thank you for your understanding. Wishing you a pleasant life and smooth work.

yours, Ji

AlkaidWang commented 11 months ago

Got you. Thank you so much! What a good team and a useful tool!