gymrek-lab / LongTR

Tandem repeat genotyping with long reads
GNU General Public License v2.0
19 stars 0 forks source link

Question about output and usage #13

Closed HLHsieh closed 2 weeks ago

HLHsieh commented 3 weeks ago

Hi Helia,

Thank you for your suggestions. All issues have been resolved. I have a few general questions as follows:

  1. As I understand, LongTR is designed to genotype both STRs and VNTRs. I am wondering if I can use LongTR to detect tandem repeats with each repeat length of more than 100bp from ONT data.

  2. I am trying to detect my data with this BED file:

    chr9    27573529        27573546        6       3       C9orf72

    The output is

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  C9ORF72_1
    chr9    27573524    C9orf72 GCCCCGGCCCCGGCCCCGGCCCCTAG  GGCCCCGGCCCCGGCCCCGGCCCCAG,GCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCTATA  .   .   START=27573529;END=27573546;PERIOD=6;NSKIP=0;NFILT=0;INEXACT_ALLELE=0,0;BPDIFFS=0,77;DP=3;DSNP=0;DFLANKINDEL=0;AN=2;REFAC=0;AC=1,1  GT:GB:Q:PQ:DP:DSNP:DFLANKINDEL:PDP:PSNP:GLDIFF:ALLREADS:MALLREADS   1|2:0|77:1.00:0.50:3:0:0:0|0:0|0:9.47:0|1;77|1;78|1:0|1;77|2
    • I am curious as to why the position started from 27573524 instead of 27573529, which I defined in the BED file.
    • Besides the genotyping results, I am also interested in knowing whether it is possible to provide information on the number of copies per allele/read. Specifically, in my case, I would like to know the number of GGCCCC repeats in each allele, and if possible, in each read.

I appreciate your help and look forward to your insights on these questions.

Best regards, Hsin

heliziii commented 3 weeks ago

Hi Hsin,

Glad it worked out!

1: You can definitely use LongTR for longer repeats, I tested LongTR for repeats up to 10k bp. 2: Whenever there is a nearby SNP (default window size 5bp), LongTR would increase the boundaries of the repeat to include the SNP. You can access the original START and END information in the INFO fields. You can change this behaviour by specifying --indel-flank-len 0. LongTR will not directly output the copy number, but you can always compute the approximate copy number (it wouldn't be exact when repeats are imperfect) by dividing allele length by motif length. LongTR also reports observed read lengths in ALLREADS format field. In your example, ALLREADS is 0|1;77|1;78|1, indicating that one of the reads had reference allele length, one had a 77bp insertion and the last one had a 78bp insertion. Note that this is the lengths observed in raw reads, before any analysis.

Let me know if you have further questions.

Best, Helia

HLHsieh commented 3 weeks ago

Hi Helia,

Thank you again. I have to say the experience of using LongTR was amazing! I have some follow-up questions:

  1. That was perfect for me. I tested a disease sample with a 3.3kb TR. I understood that I needed to set --max-tr-len (maximum length of TR that will be genotyped by LongTR). I would like to clarify: does the maximum length of TR refer to the length of a single TR or the whole insertion or deletion? I suppose it should be the length of a single TR, and I should specify --max-tr-len with a value greater than 3.3kb.

  2. If I disable the inclusion of SNPs by specifying --indel-flank-len 0, will the performance of genotyping be affected?

  3. It was perfect for me to get an approximate copy number. Here is an example as above:

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  C9ORF72_1
    chr9    27573524    C9orf72 GCCCCGGCCCCGGCCCCGGCCCCTAG  GGCCCCGGCCCCGGCCCCGGCCCCAG,GCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCTATA  .   .   START=27573529;END=27573546;PERIOD=6;NSKIP=0;NFILT=0;INEXACT_ALLELE=0,0;BPDIFFS=0,77;DP=3;DSNP=0;DFLANKINDEL=0;AN=2;REFAC=0;AC=1,1  GT:GB:Q:PQ:DP:DSNP:DFLANKINDEL:PDP:PSNP:GLDIFF:ALLREADS:MALLREADS   1|2:0|77:1.00:0.50:3:0:0:0|0:0|0:9.47:0|1;77|1;78|1:0|1;77|2
    • One of the reads 0|1 had a reference allele length. I would like to confirm whether this reference allele length refers to the GCCCCGGCCCCGGCCCCGGCCCCTAG (26bp), which was detected by LongTR or the sequence GGCCCCGGCCCCGGCCCC (18bp), which I defined in the BED file.
    • Other reads had 77bp and 78bp insertions, respectively. I suppose these are relative to the reference allele. To compute the copy number, should I calculate (26+77)/26 equaling ~ 3.96 copies, or 77/26 + 3 (provided in the BED file) equaling ~ 5.96 copies? Any suggestions on this calculation?
  4. I have BAM files from Minimap2, but I noticed you suggested generating BAM files using BWA-MEM. Have you tested any files from Minimap2?

Additionally, I found some typos, but they did not significantly affect usage of LongTR:

  1. In the FORMAT fields under Readme, there are two DSNP fields, but only one appears in the actual output.
  2. I encountered an issue suggesting I set the --max-str-len option. I believe it should be --max-tr-len. Here is the related output:
    
    Detected 1 BAM/CRAM files
    User-specified read groups for 1 unique sample(s)
    Reading region file /scratch/kinfai_root/kinfai0/hsinlun/reference/myDefinedRepeat_LongTR_chr4.bed
    Region file contains 1 regions

Processing region chr4 190066140 190092504 Skipping region as the reference allele length exceeds the threshold (26364 vs 1000) You can increase this threshold using the --max-str-len option

------HipSTR Execution Summary------ Skipped 1 loci whose lengths were above the maximum threshold. If this is a sizeable portion of your loci, see the --max-str-len command line option Genotyping succeeded for 0/0 loci

Approximate timing breakdown BAM seek time = 0 seconds Read filtering = 0 seconds SNP info extraction = 0 seconds Genotyping = 0 seconds Trimming alignment = 0 seconds Haplotype generation = 0 seconds Haplotype alignment = 0 seconds LongTR execution finished: Total runtime = 0.044359 sec



Thank you for your support and for creating such an excellent tool!

Best regards,
Hsin
heliziii commented 2 weeks ago

Hi Hsin,

I apologize for my delayed response :).

1: --max-tr-len is applied to the length of the repeat in your reference set. So for example, if you set --max-tr-len 3000, and you have a repeat with length 2500bp in your reference set in which your sample holds a 1500bp insertion (in total more than 3000bp), LongTR will genotype it.

2: no

3: 0 is exactly the allele appearing in the reference allele field (26bp). I think for the copy number you should calculate (26 + 77) / 6, as your motif is 6bp.

4: Minimap2 should work too.

5: Thank you for pointing out the typos! I really appreciate it! I will fix them!

Please let me know if you have further questions/comments.

Best, Helia

P.S. if you pull the latest version, please remove --skip-assembly from your set of parameters as it is now the default behavior.

HLHsieh commented 1 week ago

Hi Helia,

As you pointed out, you have tested LongTR for repeats up to 10k bp. I would like to know how you set the parameters for successful detection. I have a GGCCC repeat up to 20kb at a sequence depth of 50x and tried a similar command for the shorter repeat as follows:

LongTR --bams C9ORF72.sorted.bam --fasta genome.fa --regions test.bed --tr-vcf test.vcf.gz  --bam-samps C9ORF72 --bam-libs C9ORF72 --min-mean-qual -1 --max-tr-len 500000 --skip-assembly

Its log shows :

Detected 1 BAM/CRAM files
User-specified read groups for 1 unique samples
Reading region file /scratch/kinfai_root/kinfai0/hsinlun/reference/test.bed
Region file contains 1 regions

Processing region chr9 27573528 27573546
202 reads overlapped region, of which
        0 were hard clipped
        0 had an 'N' base call
        0 had low MAPQ
        0 had low base quality scores
        3 did not span the STR
        0 did not have a unique mapping
        167 PASSED ALL FILTERS
Phased SNPs add info for 0 out of 167 reads and 0 out of 1 samples
Trimming reads
Failed to trim align 167 out of 167 reads
Generating candidate haplotypes
        ACCACGCCCC GGCCCCGGCCCCGGCCCC TAGCGCGCGA
Added 0 inexact haplotypes generated by POA
Aligning reads to each candidate haplotype
Allele counts
        GGCCCCGGCCCCGGCCCC 0
Filtered 1 sample genotypes for the following reasons:  1=NO_READS
Locus timing:
 BAM seek time       = 0.009893 seconds
 Read filtering      = 0.108375 seconds
 SNP info extraction = 0.003844 seconds
 Stutter estimation  = 3e-06 seconds
 Genotyping          = 0.070579 seconds
        Trim alignment        = 0.070193 seconds
         Haplotype generation  = 0.000128 seconds
         Haplotype alignment   = 3.1e-05 seconds

------HipSTR Execution Summary------
Genotyping succeeded for 1/1 loci

Approximate timing breakdown
 BAM seek time       = 0.009893 seconds
 Read filtering      = 0.108375 seconds
 SNP info extraction = 0.003844 seconds
 Genotyping          = 0.070579 seconds
         Trimming alignment        = 0.070193 seconds
         Haplotype generation  = 0.000128 seconds
         Haplotype alignment   = 3.1e-05 seconds
LongTR execution finished: Total runtime = 1.0596 sec
-----------------

And it VCF shows:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  C9ORF72_5_10R_NanoSim_50x
chr9    27573529    C9ORF72 GGCCCCGGCCCCGGCCCC  .   .   .   START=27573529;END=27573546;PERIOD=6;NSKIP=0;NFILT=0;INEXACT_ALLELE=.;DP=0;DSNP=0;DFLANKINDEL=0;AN=0;REFAC=0    GT:GB:Q:PQ:DP:DSNP:DFLANKINDEL:PDP:PSNP:GLDIFF:ALLREADS:MALLREADS

I am wondering why genotyping succeeded for 1/1 loci, but the VCF did not show the corresponding results, and how to make the trimming alignment succeed since I found all the reads failing at the trim align stage. Do you have any other suggestions on this kind of analysis?

Many thanks, Hsin