Issue with low proportion_repeat

mhguo1 commented 2 years ago

Hello, I'm getting odd errors when using lower --proportion-repeat in strling extract. For example, when running this command:

strling extract -f GRCh38_full_analysis_set_plus_decoy_hla.fa -p 0.4 NA06989.final.cram test.bin

I get this error: fatal.nim(49) sysFatal Error: unhandled exception: genome_strs.nim(39, 12) w.start < w.stop repeat ACCTTT not found in expected region for (chrom: "chr2", start: 71388200, stop: 71388500, repeat: "ACCTTT"), TGAAGGCAGCTAAATTCTCTTACCCTGAGGCTAAGGGCAAGTAGTAGGTAACAAAGGAGTGTAAAGGAATTTATCTAGATAAGTTTATTTACTTTTGCCGACCTTTGATCATCCGACCTTTGATCATCCGACCTTTGATCATCTGACCTTTGATCATCTGACCTTTGATCATCCGACCTTTGATCATCTGACCTTTGATCATCCGCGTGCAGGACTGCTCCCTACAGGCGGGGGCAACAACTACCCACAGATTGTGTTGGCTCCAGGCCTTTGTCATTAAATCTGTACTAAATAAATACA, (chrom: "chr2", start: 71388500, stop: 71388500, repeat: "ACCTTT") [AssertionDefect]

strling version: 0.5.1 fatal.nim(49) sysFatal Error: unhandled exception: unpack.nim(59, 12) fs != nil [strling] got nil fileStream in unpack_file. check given file-path [AssertionDefect]

I am using a 1000 Genomes sample downloaded from ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239459/NA06989.final.cram

Thanks!

brentp commented 2 years ago

Thanks for reporting. Using the lower --proportion-repeat does indeed uncover some problems. I can recreate and pushed a fix. We'll need to evaluate the change in performance. I also have another idea to try to fix which I'll post shortly. Note that this is an interesting read with this repeat structure (split on ACCTTT):

TGAAGGCAGCTAAATTCTCTTACCCTGAGGCTAAGGGCAAGTAGTAGGTAACAAAGGAGTGTAAAGGAATTTATCTAGATAAGTTTATTTACTTTTGCCG
ACCTTT
GATCATCCG
ACCTTT
GATCATCCG
ACCTTT
GATCATCTG
ACCTTT
GATCATCTG
ACCTTT
GATCATCCG
ACCTTT
GATCATCTG
ACCTTT
GATCATCCGCGTGCAGGACTGCTCCCTACAGGCGGGGGCAACAACTACCCACAGATTGTGTTGGCTCCAGGCCTTTGTCATTAAATCTGTACTAAATAAATACA

brentp commented 2 years ago

BTW, I'm interested to hear what @hdashnow thinks, but my experience is that lowering proportion_repeat is usually not fruitful.

hdashnow commented 2 years ago

Thanks @brentp

@mhguo1 I'm curious why you'd like to use proportion repeat = 0.4? This is extremely low, which is why we never tested this. If you're having trouble recovering a specific locus maybe we can address that in a different way?

mhguo1 commented 2 years ago

Hi all, thank you for looking into this! Perhaps I'm not fully understanding this parameter well. I'm interested in looking at milder expansions and felt that the default value of -p 0.8 was too high so was playing around with it. If I understand this correctly, with a 150 bp PE read, a value of 0.8 for a trinucleotide repeat expansion would only consider expansions past 40 repeats (150*0.8/3).

hdashnow commented 2 years ago

@mhguo1 You are correct, STRling is tuned to discover larger expansions. Are you looking to discover new loci, or call known ones? If you are looking at a specific locus you can force STRling to genotype it even if it is small.

mhguo1 commented 2 years ago

I'm mostly hoping to call known ones, so I guess I will just add a custom bed file with known sites to -l to the call command. Thanks!

hdashnow commented 2 years ago

Great! Let us know if you have additional questions, or see any other errors. It was useful to get this one fixed.

quinlan-lab / STRling

Issue with low proportion_repeat #102