Closed HLHsieh closed 2 weeks ago
Hi Hsin,
Glad it worked out!
1: You can definitely use LongTR for longer repeats, I tested LongTR for repeats up to 10k bp. 2: Whenever there is a nearby SNP (default window size 5bp), LongTR would increase the boundaries of the repeat to include the SNP. You can access the original START and END information in the INFO fields. You can change this behaviour by specifying --indel-flank-len 0. LongTR will not directly output the copy number, but you can always compute the approximate copy number (it wouldn't be exact when repeats are imperfect) by dividing allele length by motif length. LongTR also reports observed read lengths in ALLREADS format field. In your example, ALLREADS is 0|1;77|1;78|1, indicating that one of the reads had reference allele length, one had a 77bp insertion and the last one had a 78bp insertion. Note that this is the lengths observed in raw reads, before any analysis.
Let me know if you have further questions.
Best, Helia
Hi Helia,
Thank you again. I have to say the experience of using LongTR was amazing! I have some follow-up questions:
That was perfect for me. I tested a disease sample with a 3.3kb TR. I understood that I needed to set --max-tr-len
(maximum length of TR that will be genotyped by LongTR). I would like to clarify: does the maximum length of TR refer to the length of a single TR or the whole insertion or deletion? I suppose it should be the length of a single TR, and I should specify --max-tr-len
with a value greater than 3.3kb.
If I disable the inclusion of SNPs by specifying --indel-flank-len 0
, will the performance of genotyping be affected?
It was perfect for me to get an approximate copy number. Here is an example as above:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT C9ORF72_1
chr9 27573524 C9orf72 GCCCCGGCCCCGGCCCCGGCCCCTAG GGCCCCGGCCCCGGCCCCGGCCCCAG,GCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCTATA . . START=27573529;END=27573546;PERIOD=6;NSKIP=0;NFILT=0;INEXACT_ALLELE=0,0;BPDIFFS=0,77;DP=3;DSNP=0;DFLANKINDEL=0;AN=2;REFAC=0;AC=1,1 GT:GB:Q:PQ:DP:DSNP:DFLANKINDEL:PDP:PSNP:GLDIFF:ALLREADS:MALLREADS 1|2:0|77:1.00:0.50:3:0:0:0|0:0|0:9.47:0|1;77|1;78|1:0|1;77|2
(26+77)/26
equaling ~ 3.96 copies, or 77/26 + 3
(provided in the BED file) equaling ~ 5.96 copies? Any suggestions on this calculation?I have BAM files from Minimap2, but I noticed you suggested generating BAM files using BWA-MEM. Have you tested any files from Minimap2?
Additionally, I found some typos, but they did not significantly affect usage of LongTR:
--max-str-len
option. I believe it should be --max-tr-len
. Here is the related output:
Detected 1 BAM/CRAM files
User-specified read groups for 1 unique sample(s)
Reading region file /scratch/kinfai_root/kinfai0/hsinlun/reference/myDefinedRepeat_LongTR_chr4.bed
Region file contains 1 regions
Processing region chr4 190066140 190092504 Skipping region as the reference allele length exceeds the threshold (26364 vs 1000) You can increase this threshold using the --max-str-len option
------HipSTR Execution Summary------ Skipped 1 loci whose lengths were above the maximum threshold. If this is a sizeable portion of your loci, see the --max-str-len command line option Genotyping succeeded for 0/0 loci
Approximate timing breakdown BAM seek time = 0 seconds Read filtering = 0 seconds SNP info extraction = 0 seconds Genotyping = 0 seconds Trimming alignment = 0 seconds Haplotype generation = 0 seconds Haplotype alignment = 0 seconds LongTR execution finished: Total runtime = 0.044359 sec
Thank you for your support and for creating such an excellent tool!
Best regards,
Hsin
Hi Hsin,
I apologize for my delayed response :).
1: --max-tr-len is applied to the length of the repeat in your reference set. So for example, if you set --max-tr-len 3000, and you have a repeat with length 2500bp in your reference set in which your sample holds a 1500bp insertion (in total more than 3000bp), LongTR will genotype it.
2: no
3: 0 is exactly the allele appearing in the reference allele field (26bp). I think for the copy number you should calculate (26 + 77) / 6, as your motif is 6bp.
4: Minimap2 should work too.
5: Thank you for pointing out the typos! I really appreciate it! I will fix them!
Please let me know if you have further questions/comments.
Best, Helia
P.S. if you pull the latest version, please remove --skip-assembly from your set of parameters as it is now the default behavior.
Hi Helia,
As you pointed out, you have tested LongTR for repeats up to 10k bp. I would like to know how you set the parameters for successful detection. I have a GGCCC repeat up to 20kb at a sequence depth of 50x and tried a similar command for the shorter repeat as follows:
LongTR --bams C9ORF72.sorted.bam --fasta genome.fa --regions test.bed --tr-vcf test.vcf.gz --bam-samps C9ORF72 --bam-libs C9ORF72 --min-mean-qual -1 --max-tr-len 500000 --skip-assembly
Its log shows :
Detected 1 BAM/CRAM files
User-specified read groups for 1 unique samples
Reading region file /scratch/kinfai_root/kinfai0/hsinlun/reference/test.bed
Region file contains 1 regions
Processing region chr9 27573528 27573546
202 reads overlapped region, of which
0 were hard clipped
0 had an 'N' base call
0 had low MAPQ
0 had low base quality scores
3 did not span the STR
0 did not have a unique mapping
167 PASSED ALL FILTERS
Phased SNPs add info for 0 out of 167 reads and 0 out of 1 samples
Trimming reads
Failed to trim align 167 out of 167 reads
Generating candidate haplotypes
ACCACGCCCC GGCCCCGGCCCCGGCCCC TAGCGCGCGA
Added 0 inexact haplotypes generated by POA
Aligning reads to each candidate haplotype
Allele counts
GGCCCCGGCCCCGGCCCC 0
Filtered 1 sample genotypes for the following reasons: 1=NO_READS
Locus timing:
BAM seek time = 0.009893 seconds
Read filtering = 0.108375 seconds
SNP info extraction = 0.003844 seconds
Stutter estimation = 3e-06 seconds
Genotyping = 0.070579 seconds
Trim alignment = 0.070193 seconds
Haplotype generation = 0.000128 seconds
Haplotype alignment = 3.1e-05 seconds
------HipSTR Execution Summary------
Genotyping succeeded for 1/1 loci
Approximate timing breakdown
BAM seek time = 0.009893 seconds
Read filtering = 0.108375 seconds
SNP info extraction = 0.003844 seconds
Genotyping = 0.070579 seconds
Trimming alignment = 0.070193 seconds
Haplotype generation = 0.000128 seconds
Haplotype alignment = 3.1e-05 seconds
LongTR execution finished: Total runtime = 1.0596 sec
-----------------
And it VCF shows:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT C9ORF72_5_10R_NanoSim_50x
chr9 27573529 C9ORF72 GGCCCCGGCCCCGGCCCC . . . START=27573529;END=27573546;PERIOD=6;NSKIP=0;NFILT=0;INEXACT_ALLELE=.;DP=0;DSNP=0;DFLANKINDEL=0;AN=0;REFAC=0 GT:GB:Q:PQ:DP:DSNP:DFLANKINDEL:PDP:PSNP:GLDIFF:ALLREADS:MALLREADS
I am wondering why genotyping succeeded for 1/1 loci, but the VCF did not show the corresponding results, and how to make the trimming alignment succeed since I found all the reads failing at the trim align stage. Do you have any other suggestions on this kind of analysis?
Many thanks, Hsin
Hi Helia,
Thank you for your suggestions. All issues have been resolved. I have a few general questions as follows:
As I understand, LongTR is designed to genotype both STRs and VNTRs. I am wondering if I can use LongTR to detect tandem repeats with each repeat length of more than 100bp from ONT data.
I am trying to detect my data with this BED file:
The output is
I appreciate your help and look forward to your insights on these questions.
Best regards, Hsin