Open mhguo1 opened 2 years ago
Thanks for reporting. Using the lower --proportion-repeat does indeed uncover some problems. I can recreate and pushed a fix. We'll need to evaluate the change in performance. I also have another idea to try to fix which I'll post shortly. Note that this is an interesting read with this repeat structure (split on ACCTTT):
TGAAGGCAGCTAAATTCTCTTACCCTGAGGCTAAGGGCAAGTAGTAGGTAACAAAGGAGTGTAAAGGAATTTATCTAGATAAGTTTATTTACTTTTGCCG
ACCTTT
GATCATCCG
ACCTTT
GATCATCCG
ACCTTT
GATCATCTG
ACCTTT
GATCATCTG
ACCTTT
GATCATCCG
ACCTTT
GATCATCTG
ACCTTT
GATCATCCGCGTGCAGGACTGCTCCCTACAGGCGGGGGCAACAACTACCCACAGATTGTGTTGGCTCCAGGCCTTTGTCATTAAATCTGTACTAAATAAATACA
BTW, I'm interested to hear what @hdashnow thinks, but my experience is that lowering proportion_repeat is usually not fruitful.
Thanks @brentp
@mhguo1 I'm curious why you'd like to use proportion repeat = 0.4? This is extremely low, which is why we never tested this. If you're having trouble recovering a specific locus maybe we can address that in a different way?
Hi all, thank you for looking into this! Perhaps I'm not fully understanding this parameter well. I'm interested in looking at milder expansions and felt that the default value of -p 0.8 was too high so was playing around with it. If I understand this correctly, with a 150 bp PE read, a value of 0.8 for a trinucleotide repeat expansion would only consider expansions past 40 repeats (150*0.8/3).
@mhguo1 You are correct, STRling is tuned to discover larger expansions. Are you looking to discover new loci, or call known ones? If you are looking at a specific locus you can force STRling to genotype it even if it is small.
I'm mostly hoping to call known ones, so I guess I will just add a custom bed file with known sites to -l to the call command. Thanks!
Great! Let us know if you have additional questions, or see any other errors. It was useful to get this one fixed.
Hello, I'm getting odd errors when using lower --proportion-repeat in strling extract. For example, when running this command:
strling extract -f GRCh38_full_analysis_set_plus_decoy_hla.fa -p 0.4 NA06989.final.cram test.bin
I get this error: fatal.nim(49) sysFatal Error: unhandled exception: genome_strs.nim(39, 12)
w.start < w.stop
repeat ACCTTT not found in expected region for (chrom: "chr2", start: 71388200, stop: 71388500, repeat: "ACCTTT"), TGAAGGCAGCTAAATTCTCTTACCCTGAGGCTAAGGGCAAGTAGTAGGTAACAAAGGAGTGTAAAGGAATTTATCTAGATAAGTTTATTTACTTTTGCCGACCTTTGATCATCCGACCTTTGATCATCCGACCTTTGATCATCTGACCTTTGATCATCTGACCTTTGATCATCCGACCTTTGATCATCTGACCTTTGATCATCCGCGTGCAGGACTGCTCCCTACAGGCGGGGGCAACAACTACCCACAGATTGTGTTGGCTCCAGGCCTTTGTCATTAAATCTGTACTAAATAAATACA, (chrom: "chr2", start: 71388500, stop: 71388500, repeat: "ACCTTT") [AssertionDefect]strling version: 0.5.1 fatal.nim(49) sysFatal Error: unhandled exception: unpack.nim(59, 12)
fs != nil
[strling] got nil fileStream in unpack_file. check given file-path [AssertionDefect]I am using a 1000 Genomes sample downloaded from ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239459/NA06989.final.cram
Thanks!