gymreklab / GangSTR

A tool for profiling long STRs from short reads
GNU General Public License v2.0
80 stars 16 forks source link

Expansions detection #95

Closed leorebensabath closed 3 years ago

leorebensabath commented 3 years ago

Hi GangSTR team,

I want to use GangSTR to detect repeat expansions and I have a few questions.

chr4 3074877 3074933 100

But when I run GangSTR, I get this warning : WARNING: Unknown STR info column detected... 100 And in the output vcf file, the 3 values of the QEXP field are -1, even at the location of the str-info file. Could you tell me why my formatting of the str-info file is wrong?

Thanks !

nmmsv commented 3 years ago

Hello, I'll try to answer your questions one by one:

Please let me know if you have additional questions! Best, Nima

leorebensabath commented 3 years ago

Thanks a lot this is very helpful ! Just another question about the running time, I have a whole genome bam file and GangSTR took about 1h30-2h to process chromosome 1. Is it a regular running time ? Is there any good practice to reduce it ?

nmmsv commented 3 years ago

I don't remember the running time on chr1 off the top of my head, but that sounds reasonable to me. For a full whole-genome bam we get ~25h running time on average on a single core. Unfortunately, we haven't yet implemented any parallelization into the method, but you can "manually" parallelize the run if you want. For whole-genome runs, you can use --chrom input flag to run each chromosome separately (with a separate GangSTR call), and then merge the results using mergeSTR. To get faster on a single chromosome, you can potentially split the input bed into smaller chunks and run each one separately, and merge. I know, it's not the most elegant solution :D Please let me know if you had any other questions!