gymreklab / GangSTR

A tool for profiling long STRs from short reads
GNU General Public License v2.0
80 stars 16 forks source link

ERROR: Not enough reads for [sample ID]. Please set insert size distribution manually. #86

Open shiyirong opened 4 years ago

shiyirong commented 4 years ago

Hello! There is an error when I apply GangSTR on 3079 samples, I am informed that there are no enough reads for some samples and I don’t know how to set insert size distribution manually. I attempt to remove this kind of samples from my bamlist, how can I select them without undergoing the whole GangSTR process.

[GangSTR-2.4.2] ProgressMeter: Loading read group id R18072932LD01-2967813 for sample R18072932LD01-2967813 [GangSTR-2.4.2] ERROR: Not enough reads for 086600D 92. Please set insert size distribution manually.

Thank you very much!

nmmsv commented 4 years ago

Hello, This error occurs when the region around your locus doesn't contain too many reads. If you're creating bamlets from WGS data, I suggest increasing the size of the region that you extract around the repeating region to avoid this error. If that's not an option, you can use these options to set insert size distribution manually:

You can calculate the insert size distribution manually per sample or for a group of samples before running GangSTR. Hope this helps, please let me know if you had any other questions.

Nima

shiyirong commented 4 years ago

Thank you for your reply!

Sorry I did not make myself clear. I wanted to remove these samples which would trigger this error and run GangSTR on the remaining samples, but I do not know how to find out these samples, as I think my samples are large enough and removing several samples would be fine.

For your second suggestion, should I calculate the insert size distribution for each sample and use the mean of their insertmean and insertsdev? I intend to apply GangSTR on over 3,000 samples in a single run. Would it be inaccurate if specifying a single insertmean and insertsdev?

Bests,

nmmsv commented 4 years ago

Oh I understand now. Unfortunately, I can't think of an easy way for flagging these samples. Maybe a wrapper script that calls GangSTR for each sample, and handles error cases can help in this case.

Ideally for each sample separately, and then enter a list of all insert size distributions as a comma separated parameter. --insertmean mean1,mean2,mean3 --insertsdev sdev1,sdev2,sdev3

You can also use a single parameter for all if you don't see a huge difference in insert size distributions among different samples.

nmmsv commented 4 years ago

Please let me know if you have any other questions, I will close this issue for now but you can email me (mousavi@ucsd.edu) or open another issue for other problems! Thanks! :)

claudiamoreau commented 3 years ago

Hi, I get a similar error along with BGZF header error when I use GangSTR on simulated reads (using wgsim and bwa to align as it was described in your article https://doi.org/10.1093/nar/gkz501). I simulated 5000bp before and after the STR. [E::bgzf_read_block] Invalid BGZF header at offset 1453 [E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes [GangSTR-2.5.0] ProgressMeter: Not enough reads extracted. Skipping locus.. I tried to set --insertmean and --insertsdev to different values, but it did not work. GangSTR runs perfectly on my real wgs bams. Thank you! Claudia

nmmsv commented 3 years ago

Hi Claudia, It seems like htslib is not able to read the bam file (I think that's what's happening given the bgzf error). Can you please try opening the bam files and viewing the content using samtools? samtools view file.bam or samtools view file.bam chrNN:POS-END to look at reads that overlap a specific position (i.e. the repeat region). Best, Nima

claudiamoreau commented 3 years ago

Hi Nima, Thanks for answering so fast. It looks like I have the chromosome and position in my simulations: TNRC6A_21_3525_3929_1:0:0_0:0:0_0 99 16 24623286 60 150M = 24623541 405 ACATCTGTAATCCCAGCTACTCGAGAGGCTGAGGCACGAGAATCGCTTGAGCCCAGGAGGCAGAAGTTGCAGTGAGCCGAGATCATGCCACTTAACTCCAGCCTGGACAACAGATCAAGACTCCATCACAATAAAATAAAATGAAATAAA 888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888 NM:i:1 MD:Z:127T22 MC:Z:150M AS:i:145 XS:i:62 TNRC6A_21_3525_3929_1:0:0_0:0:0_0 147 16 24623541 60 150M = 24623286 -405 ACCAACGTCTGTTCGGCAAGAAACTTTTAACAACGACAGAGCAGAATCCGCCGAGTGACTCGATGACAGGGTGCTATTTCTATCTCCATTTCCAAGATGGAGAAACTGAGGCAAGGAGCAGCTCATTAGGTCACTTGCTTAAGGGCACTC 888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888 NM:i:0 MD:Z:150 MC:Z:150M AS:i:150 XS:i:27 Claudia

nmmsv commented 3 years ago

Hi Claudia, It seems like the file format is correct. I'm not sure what may be causing the error. Is it possible for you to share the bam file and the region file and the reference genome that was used in your work? I can replicate the error on my end and see if I can come up with a fix. Best, Nima

claudiamoreau commented 3 years ago

Sure thanks. Sorry can't attach the files here I'll send you an email. The reference genome is the Homo_sapiens.GRCh37 build. Just to mention, when I run GangSTR on my real data, it works. My real data looks quite the same than the simulations: HWI-D00280:59:C589DANXX:7:2306:7371:82134 99 1 10000 15 109M1I15M = 10002 126 ATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTTAACCCTAACACTAACACTAACCCTAACCCTAACCATAACCCTAACCCTAACACTAACCCTAACCCAAACCCTAAC 68?3>>@3?<?=??:?==??=?????=?:???;:0=?=?<???=?4<0>=00==?=2/4@<2/>0>>###################################################### MC:Z:107M1I17M BD:Z:IIHJNOKMLKKLIKKJJLIJJIIKHJJIIKHKKJJLHKKJJLIKKJJLKJJJJLIKKJJHHLKJJHHLKJJIIIIIIIIIIIIIIIIIIIIIIIIIIIIJJJJJJJJJJJKKKKKLLLMMNIIII MD:Z:48C10C5C18C16C12T9 PG:Z:MarkDuplicates RG:Z:MPS12302935-E05_1895_7 BI:Z:LLJLOOLPNLMMJNMKLMJNLKKLIMMJKLINLJKLJNMKLMJNLJLMNLMKLMJNMKLJINMKLJINMKLKKKKLLLLLLLLLLLLLLLLLLLMMMMMMMMMMMNNNNNNNOOOPPPQQRLLLL NM:i:7 MQ:i:15 AS:i:87 XS:i:94 HWI-D00279:105:C477TANXX:3:2201:9063:23562 163 1 10000 27 111M14S = 10004 128 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCAACCCCTAACCATAACCCTAACCCTAACCCAAACAAAAACCCTACC ?@=??-><.,<-@??,?-2<???2??>??@??>.2<-;..2<>-.;2.?-<=2@>;:<A@=############################################################ MC:Z:105M1I19M BD:Z:JJKJLOJMKJJLHKJIILHKJIILHKJIILHKJIILHKJJLHHKJIILHKJIILHKJIILHKJIIIIIIIIIIIIIIIIIIIIIIIJJJJJJJJJJJJJJJJJJJJJJKKKKKKKLLLMMNJJJJ MD:Z:0A38A27T11T1A8C20 PG:Z:MarkDuplicates RG:Z:MPS12302935-E05_1928_3 BI:Z:LLNMOOLPNMMMJNMLLMJNMLLMJNMLLLINLKKLJNMLMJJNMLLMJNMLLMJNMLLMJNMLLKKKKKKLLLLLLLLLLLLLLLLLLLMMMMMMMMMMMMMNNNNNNNNOOOOPPPQQRLLLL NM:i:6 MQ:i:27 AS:i:85 XS:i:84 Claudia

Lionward commented 1 year ago

Sure thanks. Sorry can't attach the files here I'll send you an email. The reference genome is the Homo_sapiens.GRCh37 build. Just to mention, when I run GangSTR on my real data, it works. My real data looks quite the same than the simulations: HWI-D00280:59:C589DANXX:7:2306:7371:82134 99 1 10000 15 109M1I15M = 10002 126 ATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTTAACCCTAACACTAACACTAACCCTAACCCTAACCATAACCCTAACCCTAACACTAACCCTAACCCAAACCCTAAC 68>@3???:?==??=?????=?:???;:0==00==?=2/4@<>2/>0>>###################################################### MC:Z:107M1I17M BD:Z:IIHJNOKMLKKLIKKJJLIJJIIKHJJIIKHKKJJLHKKJJLIKKJJLKJJJJLIKKJJHHLKJJHHLKJJIIIIIIIIIIIIIIIIIIIIIIIIIIIIJJJJJJJJJJJKKKKKLLLMMNIIII MD:Z:48C10C5C18C16C12T9 PG:Z:MarkDuplicates RG:Z:MPS12302935-E05_1895_7 BI:Z:LLJLOOLPNLMMJNMKLMJNLKKLIMMJKLINLJKLJNMKLMJNLJLMNLMKLMJNMKLJINMKLJINMKLKKKKLLLLLLLLLLLLLLLLLLLMMMMMMMMMMMNNNNNNNOOOPPPQQRLLLL NM:i:7 MQ:i:15 AS:i:87 XS:i:94 HWI-D00279:105:C477TANXX:3:2201:9063:23562 163 1 10000 27 111M14S = 10004 128 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCAACCCCTAACCATAACCCTAACCCTAACCCAAACAAAAACCCTACC ?@=??-><.,<-@??,?-2??2????@??>.2<-;..2<>-.;2.?-<=2@>;:<A@=############################################################ MC:Z:105M1I19M BD:Z:JJKJLOJMKJJLHKJIILHKJIILHKJIILHKJIILHKJJLHHKJIILHKJIILHKJIILHKJIIIIIIIIIIIIIIIIIIIIIIIJJJJJJJJJJJJJJJJJJJJJJKKKKKKKLLLMMNJJJJ MD:Z:0A38A27T11T1A8C20 PG:Z:MarkDuplicates RG:Z:MPS12302935-E05_1928_3 BI:Z:LLNMOOLPNMMMJNMLLMJNMLLMJNMLLLINLKKLJNMLMJJNMLLMJNMLLMJNMLLMJNMLLKKKKKKLLLLLLLLLLLLLLLLLLLMMMMMMMMMMMMMNNNNNNNNOOOOPPPQQRLLLL NM:i:6 MQ:i:27 AS:i:85 XS:i:84 Claudia

I have the same problem with simulation data, did you manage to solve this problem?