gymrek-lab / LongTR

Tandem repeat genotyping with long reads
GNU General Public License v2.0
22 stars 0 forks source link

No repeat were detected #12

Closed HLHsieh closed 3 months ago

HLHsieh commented 3 months ago

Hi there,

After your suggestions, I can run LongTR on my data as follows:

LongTR --bams C9ORF72_1_10R.sorted.bam --fasta $dir/genome.fa --regions $dir/test.bed --tr-vcf $dir/test.vcf.gz --phased-bam --bam-samps C9ORF72_1 --bam-libs C9ORF72_1

Message showed:

Using phased BAM tags to genotype and phase TRs (WARNING: Any arguments provided to --snp-vcf will be ignored)
Detected 1 BAM/CRAM files
User-specified read groups for 1 unique samples
Reading region file /scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/test.bed
Region file contains 1 regions

Processing region chr9 27573528 27573546
111 reads overlapped region, of which
    0 were hard clipped
    0 had an 'N' base call
    0 had low MAPQ
    111 had low base quality scores
    0 did not span the STR
    0 did not have a unique mapping
    0 PASSED ALL FILTERS
Phased SNPs add info for 0 out of 0 reads
Skipping locus with too few reads: TOTAL=0, MIN=10

------HipSTR Execution Summary------
Skipped 1 loci with too few reads for stutter model model training or genotyping.
     If this is a sizeable portion of your loci, see the --min-reads command line option
Genotyping succeeded for 0/0 loci

Approximate timing breakdown
 BAM seek time       = 0.007054 seconds
 Read filtering      = 0.065989 seconds
 SNP info extraction = 2.6e-05 seconds
 Genotyping          = 0 seconds
     Trimming alignment        = 0 seconds
     Haplotype generation  = 0 seconds
     Haplotype alignment   = 0 seconds
LongTR execution finished: Total runtime = 3.21404 sec
-----------------

Therefore, no repeats were detected. This data should contain some repeat regions. I am wondering if there is something wrong with the command or if you have any suggestions on how I can adjust it accordingly.

Many thanks, Hsin

heliziii commented 3 months ago

Hi Hsin,

It seems that your locus is skipped because all overlapping reads had low quality, you can decrease the threshold by specifying a value for --min-mean-qual, the default value is 30.

Best, Helia

HLHsieh commented 3 months ago

Hi Helia,

Thank you for your quick response. I was trying the following command, but the same error message returned.

LongTR --bams C9ORF72_1_10R.sorted.bam --fasta $dir/genome.fa --regions $dir/test.bed --tr-vcf $dir/test.vcf.gz --phased-bam --bam-samps C9ORF72_1 --bam-libs C9ORF72_1 --min-mean-qual 0 --min-reads 1
Detected 1 BAM/CRAM files
User-specified read groups for 1 unique samples
Reading region file /scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/test.bed
Region file contains 1 regions

Processing region chr9 27573528 27573546
111 reads overlapped region, of which
    0 were hard clipped
    0 had an 'N' base call
    0 had low MAPQ
    111 had low base quality scores
    0 did not span the STR
    0 did not have a unique mapping
    0 PASSED ALL FILTERS
Phased SNPs add info for 0 out of 0 reads and 0 out of 0 samples
Skipping locus with too few reads: TOTAL=0, MIN=1

------HipSTR Execution Summary------
Skipped 1 loci with too few reads for stutter model model training or genotyping.
     If this is a sizeable portion of your loci, see the --min-reads command line option
Genotyping succeeded for 0/0 loci

Approximate timing breakdown
 BAM seek time       = 0.006804 seconds
 Read filtering      = 0.065036 seconds
 SNP info extraction = 4.3e-05 seconds
 Genotyping          = 0 seconds
     Trimming alignment        = 0 seconds
     Haplotype generation  = 0 seconds
     Haplotype alignment   = 0 seconds
LongTR execution finished: Total runtime = 3.1865 sec
-----------------

Any suggestions would be appreciated.

Many thanks, Hsin

wdecoster commented 3 months ago

I have had to specify "-1" for that parameter to avoid removing all (nanopore) reads.

HLHsieh commented 3 months ago

Hi,

Thanks. Specified "-1" for that parameter can include all reads, but no repeats were detected in the output file.

Log:

Detected 1 BAM/CRAM files
User-specified read groups for 1 unique samples
Reading region file /scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/test.bed
Region file contains 1 regions

Processing region chr9 27573528 27573546
111 reads overlapped region, of which
        0 were hard clipped
        0 had an 'N' base call
        0 had low MAPQ
        0 had low base quality scores
        0 did not span the STR
        0 did not have a unique mapping
        111 PASSED ALL FILTERS
Phased SNPs add info for 0 out of 111 reads and 0 out of 1 samples
Trimming reads
Failed to trim align 6 out of 111 reads
Generating candidate haplotypes
        GCCCCGGGCCCGCCCCCGGGCCCGCCCCGACCACG CCCCGGCCCCGGCCCCGGCCCC                                                                               TAGCGCGCGACTCCTGAGTTCCAGAGCTTGCTACA
                                            GCCCCGGCCCCGGCCCCGGCCCC
                                            CCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCC

                                            GCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGC...GCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCC

Added 2 inexact haplotypes generated by POA
Aborting genotyping of the locus as the sequence upstream of the repeat is too repetitive for accurate genotyping
        Flanking sequence = GCCCCGGGCCCGCCCCCGGGCCCGCCCCGACCACG
Locus timing:
 BAM seek time       = 0.0069 seconds
 Read filtering      = 0.085952 seconds
 SNP info extraction = 0.005669 seconds
 Stutter estimation  = 4e-06 seconds
 Genotyping          = 0.067338 seconds
        Trim alignment        = 0.057342 seconds
         Haplotype generation  = 0.009429 seconds
         Haplotype alignment   = 0 seconds

------HipSTR Execution Summary------
Genotyping succeeded for 0/1 loci

Approximate timing breakdown
 BAM seek time       = 0.0069 seconds
 Read filtering      = 0.085952 seconds
 SNP info extraction = 0.005669 seconds
 Genotyping          = 0.067338 seconds
         Trimming alignment        = 0.057342 seconds
         Haplotype generation  = 0.009429 seconds
         Haplotype alignment   = 0 seconds
LongTR execution finished: Total runtime = 3.27426 sec
-----------------

VCF file

##fileformat=VCFv4.1
##command=LongTR-638942f-dirty --bams C9ORF72_1_10R_NanoSim_50x.sorted.bam --fasta /scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/genome.fa --regions /scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/test.bed --tr-vcf /scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/test.vcf.gz --bam-samps C9ORF72_1 --bam-libs C9ORF72_1 --min-mean-qual -1 --min-reads 1
##reference=/scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/genome.fa
##contig=<ID=chrM,length=16569>
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
##contig=<ID=chr6,length=170805979>
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr8,length=145138636>
##contig=<ID=chr9,length=138394717>
##contig=<ID=chr10,length=133797422>
##contig=<ID=chr11,length=135086622>
##contig=<ID=chr12,length=133275309>
##contig=<ID=chr13,length=114364328>
##contig=<ID=chr14,length=107043718>
##contig=<ID=chr15,length=101991189>
##contig=<ID=chr16,length=90338345>
##contig=<ID=chr17,length=83257441>
##contig=<ID=chr18,length=80373285>
##contig=<ID=chr19,length=58617616>
##contig=<ID=chr20,length=64444167>
##contig=<ID=chr21,length=46709983>
##contig=<ID=chr22,length=50818468>
##contig=<ID=chrX,length=156040895>
##contig=<ID=chrY,length=57227415>
##INFO=<ID=INEXACT_ALLELE,Number=A,Type=Integer,Description="Boolean showing if each alternate allele is exact or approximated by POA, 0 for exact 1 for approximated.">
##INFO=<ID=BPDIFFS,Number=A,Type=Integer,Description="Base pair difference of each alternate allele from the reference allele">
##INFO=<ID=START,Number=1,Type=Integer,Description="Inclusive start coodinate for the repetitive portion of the reference allele">
##INFO=<ID=END,Number=1,Type=Integer,Description="Inclusive end coordinate for the repetitive portion of the reference allele">
##INFO=<ID=PERIOD,Number=1,Type=Integer,Description="Length of STR motif">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=REFAC,Number=1,Type=Integer,Description="Reference allele count">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Alternate allele counts">
##INFO=<ID=NSKIP,Number=1,Type=Integer,Description="Number of samples not genotyped due to various issues">
##INFO=<ID=NFILT,Number=1,Type=Integer,Description="Number of samples whose genotypes were filtered due to various issues">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total number of valid reads used to genotype all samples">
##INFO=<ID=DSNP,Number=1,Type=Integer,Description="Total number of reads with SNP phasing information">
##INFO=<ID=DFLANKINDEL,Number=1,Type=Integer,Description="Total number of reads with an indel in the regions flanking the STR">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GB,Number=1,Type=String,Description="Base pair differences of genotype from reference">
##FORMAT=<ID=Q,Number=1,Type=Float,Description="Posterior probability of unphased genotype">
##FORMAT=<ID=PQ,Number=1,Type=Float,Description="Posterior probability of phased genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of valid reads used for sample's genotype">
##FORMAT=<ID=DSNP,Number=1,Type=Integer,Description="Number of reads with SNP phasing information">
##FORMAT=<ID=PSNP,Number=1,Type=String,Description="Number of reads with SNPs supporting each haploid genotype">
##FORMAT=<ID=PDP,Number=1,Type=String,Description="Fractional reads supporting each haploid genotype">
##FORMAT=<ID=GLDIFF,Number=1,Type=Float,Description="Difference in likelihood between the reported and next best genotypes">
##FORMAT=<ID=DFLANKINDEL,Number=1,Type=Integer,Description="Number of reads with an indel in the regions flanking the STR">
##FORMAT=<ID=AB,Number=1,Type=Float,Description="log10 of the allele bias pvalue, where 0 is no bias and more negative values are increasingly biased. 0 for all homozygous genotypes">
##FORMAT=<ID=FS,Number=1,Type=Float,Description="log10 of the strand bias pvalue from Fisher's exact test, where 0 is no bias and more negative values are increasingly biased. 0 for all homozygous genotypes">
##FORMAT=<ID=DAB,Number=1,Type=Integer,Description="Number of reads used in the AB and FS calculations">
##FORMAT=<ID=ALLREADS,Number=1,Type=String,Description="Base pair difference observed in each read's Needleman-Wunsch alignment">
##FORMAT=<ID=MALLREADS,Number=1,Type=String,Description="Maximum likelihood bp diff in each read based on haplotype alignments for reads that span the repeat region by at least 5 base pairs">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  C9ORF72_1

My test bed file:

chr9    27573529        27573546        6       3       C9orf72

Any suggestions would be appreciated.

Many thanks, Hsin

heliziii commented 3 months ago

Hi Hsin,

You are getting this error because LongTR is trying to assemble the flanking sequence, which is unnecessary with long reads, we recommend adding --skip-assembly to the command when using long reads to avoid this step.

Best, Helia