gymreklab / GangSTR

A tool for profiling long STRs from short reads
GNU General Public License v2.0
80 stars 16 forks source link

GangSTR fails on small input files #103

Open bkohrn opened 3 years ago

bkohrn commented 3 years ago

I am part of a team doing targeted sequencing of PolyG loci. We have previously been using LobSTR for genotyping polyG sequences from reads prior to a consensus making step (see https://github.com/risqueslab/PolyG-DS). I was investigating GangSTR on the recommendation of one of our collaborators to see if we might want to switch to using it. However, I keep running into the following message (for most of our loci):

[GangSTR-2.5.0] ProgressMeter: Processing chr1:46990539
[GangSTR-2.5.0] ProgressMeter:  Setting flanking regions
[GangSTR-2.5.0] ProgressMeter:  Loading read data
[GangSTR-2.5.0] ProgressMeter:  Not enough reads extracted. Skipping locus..

and the .readinfo.tab file is empty. I know that these input files generated good polyG data using lobSTR, so I was wondering if you had any idea what might be happening and how we might avoid it happening in the future. An example line from our regions file (hg19-based) is: chr1 46990539 46990839 1 C

nmmsv commented 3 years ago

Hello, It seems like GangSTR is rejecting all of the reads that are picked up from that region. Can you try changing the parameters for local realignment to see if they make a difference on the output? For example, --minmatch is the minimum number of bases surrounding the repeat region that must match the flanking sequence around the repeat. By default it's set to 5, but sometimes around runs of homopolymers the sequencing error rate is very high and that may lead to rejecting the reads. Please let me know if that didn't help. Best, Nima

bkohrn commented 3 years ago

Reducing --minmatch doesn't seem to have helped (in this case, at least); I'm still getting an empty file for the .readinfo.tab file. My command looks like this right now:

 ~/bioinformatics/programs/GangSTR-2.5.0-Source/build/GangSTR \
--bam ${inBaseName}.sort.bam \
--targeted \
--ref ${inRef} \
--regions ${inReg} \
--out ${inBaseName}.gangSTR8 \
--output-readinfo \
--verbose \
--insertmean 1 \
--insertsdev 1 \
--minmatch 1 \
--min-sample-reads 1 \
--max-proc-read 1000000 \
--output-readinfo \
--output-bootstraps \
--read-prob-mode \
--verbose --very 

I also tried --minmatch 3, but it didn't output anything either. Another (possibly related) issue is that the VCF file has "." for every row in the SAMPLE column, and the only things it seems to be picking up on are whole reads where the whole read is the same base (and which reads would normally be filtered out as being obviously bad).

This data originates from pools of CRISPR-based cut sites, if that helps any in figuring out what happened.

nmmsv commented 3 years ago

Humm I'm not sure what the issue may be. GangSTR has been mostly used/tested on paired-end PCR free WGS datasets. I haven't worked with data generated from CRISPR-based cut site, but I assume several of the assumptions made by GangSTR may not hold in them. Having . in the sample column is expected since GangSTR isn't able to make any calls for any samples. Another potential problem is that GangSTR focuses on genotyping long TRs using the context derived from paired-end reads. So for genotyping variations that fit within a single read (and mutations around it), HipSTR is probably a more suitable method. Hope this helps. Best, Nima

On Wed, Mar 3, 2021 at 11:01 AM bkohrn notifications@github.com wrote:

Reducing --minmatch doesn't seem to have helped (in this case, at least); I'm still getting an empty file for the .readinfo.tab file. My command looks like this right now:

~/bioinformatics/programs/GangSTR-2.5.0-Source/build/GangSTR \ --bam ${inBaseName}.sort.bam \ --targeted \ --ref ${inRef} \ --regions ${inReg} \ --out ${inBaseName}.gangSTR8 \ --output-readinfo \ --verbose \ --insertmean 1 \ --insertsdev 1 \ --minmatch 1 \ --min-sample-reads 1 \ --max-proc-read 1000000 \ --output-readinfo \ --output-bootstraps \ --read-prob-mode \ --verbose --very

I also tried --minmatch 3, but it didn't output anything either. Another (possibly related) issue is that the VCF file has "." for every row in the SAMPLE column, and the only things it seems to be picking up on are whole reads where the whole read is the same base (and which reads would normally be filtered out as being obviously bad).

This data originates from pools of CRISPR-based cut sites, if that helps any in figuring out what happened.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/gymreklab/GangSTR/issues/103*issuecomment-789978999__;Iw!!Mih3wA!TvuoGaXe75rEIGF8ArKOt9ZQs8-I-Li6qrfWevyiWUWYnFuQuM6KkR0IRVN66mo$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ADSOQCDP7QU362DNAKWRUHLTB2BPTANCNFSM4YPT3BUQ__;!!Mih3wA!TvuoGaXe75rEIGF8ArKOt9ZQs8-I-Li6qrfWevyiWUWYnFuQuM6KkR0ItDZn534$ .