mlbendall / telescope

Quantification of transposable element expression using RNA-seq
MIT License
64 stars 17 forks source link

AssertionError: Error #28

Open AssumeAssume opened 3 years ago

AssumeAssume commented 3 years ago

Hello, Recently I met a problem when annotating the TE at family level, here was the error report:

2021-11-02 17:57:16 INFO
Version:                          1.0.3.1
Input Options
    samfile:                      KO_rep1.sorted.bam
    gtffile:                       GRCh38_GENCODE_rmsk_TE.gtf
    attribute:                    gene_id
    no_feature_key:               __no_feature
    ncpu:                         1
    tempdir:                      None
Reporting Options
    quiet:                        False
    debug:                        False
    logfile:                      <stderr>
    outdir:                       RepeatMasker/KMT2D_KO_rep1.sorted.bam
    exp_tag:                      telescope
    updated_sam:                  False
Run Modes
    reassign_mode:                choose
    conf_prob:                    0.9
    overlap_mode:                 threshold
    overlap_threshold:            0.7
    annotation_class:             intervaltree
Model Parameters
    pi_prior:                     0
    theta_prior:                  200000
    em_epsilon:                   1e-07
    max_iter:                     100
    use_likelihood:               False
    skip_em:                      False
 (from run in telescope_assign.py:197)
2021-11-02 17:57:16 INFO     Loading annotation...                                        (from run in telescope_assign.py:205)
Traceback (most recent call last):
  File "/LiuLab/software/miniconda3/envs/telescope_env/bin/telescope", line 8, in <module>
    sys.exit(main())
  File "/LiuLab/software/miniconda3/envs/telescope_env/lib/python3.6/site-packages/telescope/__main__.py", line 95, in main
    args.func(args)
  File "/LiuLab/software/miniconda3/envs/telescope_env/lib/python3.6/site-packages/telescope/telescope_assign.py", line 207, in run
    annot = Annotation(opts.gtffile, opts.attribute)
  File "/LiuLab/software/miniconda3/envs/telescope_env/lib/python3.6/site-packages/telescope/utils/_annotation_intervaltree.py", line 58, in __init__
    assert len(mergeable) == 1, "Error"
AssertionError: Error

Here was a preview of the GTF file

chr1    hg38_rmsk   exon    100000001   100000637   11325   -   .   gene_id "L1M2"; transcript_id "L1M2_dup229"; family_id "L1"; class_id "LINE";
chr1    hg38_rmsk   exon    10000002    10000239    1760    +   .   gene_id "AluSx3"; transcript_id "AluSx3_dup157"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100000744   100002612   11325   -   .   gene_id "L1M2"; transcript_id "L1M2_dup230"; family_id "L1"; class_id "LINE";
chr1    hg38_rmsk   exon    10000251    10000566    2225    +   .   gene_id "AluSx"; transcript_id "AluSx_dup700"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100002613   100002913   1799    -   .   gene_id "AluJr"; transcript_id "AluJr_dup3513"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100002914   100003133   11325   -   .   gene_id "L1M2"; transcript_id "L1M2_dup231"; family_id "L1"; class_id "LINE";
chr1    hg38_rmsk   exon    100003134   100003444   2452    +   .   gene_id "AluSq2"; transcript_id "AluSq2_dup2641"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100003445   100004305   6882    -   .   gene_id "L1M2"; transcript_id "L1M2_dup232"; family_id "L1"; class_id "LINE";
chr1    hg38_rmsk   exon    100004306   100004578   1643    -   .   gene_id "AluJb"; transcript_id "AluJb_dup6336"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100004579   100004715   6794    -   .   gene_id "L1M2"; transcript_id "L1M2_dup233"; family_id "L1"; class_id "LINE";

I've found the origin of the problem, was a function called merge_intervals at _annotation_intervaltree.py, it could only merge two overlapped region, when the number of regions raised to three or more, the script would raise these error, like this:

KI270805.1  hg38_rmsk   exon    100001  100045  271 -   .   gene_id "LTR1B"; transcript_id "LTR1B_dup330"; family_id "ERV1"; class_id "LTR";
KI270805.1  hg38_rmsk   exon    100046  100092  282 -   .   gene_id "LTR1B"; transcript_id "LTR1B_dup331"; family_id "ERV1"; class_id "LTR";
KI270805.1  hg38_rmsk   exon    100091  100137  282 -   .   gene_id "LTR1B"; transcript_id "LTR1B_dup332"; family_id "ERV1"; class_id "LTR";
KI270805.1  hg38_rmsk   exon    100136  100182  254 -   .   gene_id "LTR1B"; transcript_id "LTR1B_dup333"; family_id "ERV1"; class_id "LTR";
KI270805.1  hg38_rmsk   exon    100181  100227  261 -   .   gene_id "LTR1B"; transcript_id "LTR1B_dup334"; family_id "ERV1"; class_id "LTR";

The simplest solution for me was to remove those contigs in the GTF file, cat GRCh38_GENCODE_rmsk_TE.gtf|grep -i chr > new_TE_remove_contig.gtf But I think it would be very nice of you to define a new function which could merge several overlapped regions rather than only two.

Thanks, Xiufeng