ksahlin / ultra

Long-read splice alignment with high accuracy
60 stars 10 forks source link

Failed in building index #24

Closed FadelBerakdar closed 3 months ago

FadelBerakdar commented 4 months ago

Hello,

I am facing this issue when building index using GENCODE GRCh38 and even tested with (T2T + Refseq annotation), with and without --disable_infer

command:  uLTRA index --disable_infer  GRCh38.primary_assembly.genome.fa.gz gencode.v45.primary_assembly.annotation.gtf.gz output_dir

reference source from https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/

log:

/mnt/storage2/users/ahberam1/LRgaspRNAseqBenchmark/workflow/.snakemake/conda/1007022e9f0043b1868092737a6163ba_/lib/python3.8/site-packages/gffutils/create.py:763: UserWarning: It appears you have a transcript feature in your GTF file. You may want to use the `disable_infer_transcripts=True` option to speed up database creation
  warnings.warn(
/mnt/storage2/users/ahberam1/LRgaspRNAseqBenchmark/workflow/.snakemake/conda/1007022e9f0043b1868092737a6163ba_/lib/python3.8/site-packages/gffutils/create.py:770: UserWarning: It appears you have a gene feature in your GTF file. You may want to use the `disable_infer_genes=True` option to speed up database creation
  warnings.warn(
creating /mnt/storage2/users/ahberam1/LRgaspRNAseqBenchmark/resources/refs/GRCh38_T2T_v2/indexes/uLTRA__default/index
Traceback (most recent call last):
  File "/mnt/storage2/users/ahberam1/LRgaspRNAseqBenchmark/workflow/.snakemake/conda/1007022e9f0043b1868092737a6163ba_/bin/uLTRA", line 555, in <module>
    prep_splicing(args, refs_lengths)
  File "/mnt/storage2/users/ahberam1/LRgaspRNAseqBenchmark/workflow/.snakemake/conda/1007022e9f0043b1868092737a6163ba_/bin/uLTRA", line 81, in prep_splicing
    max_intron_chr, exon_choordinates_to_id, chr_to_id, id_to_chr = augmented_gene.create_graph_from_exon_parts(db, args.flank_size, args.small_exon_threshold, args.min_segm, refs_lengths)
  File "/mnt/storage2/users/ahberam1/LRgaspRNAseqBenchmark/workflow/.snakemake/conda/1007022e9f0043b1868092737a6163ba_/lib/python3.8/site-packages/modules/create_augmented_gene.py", line 440, in create_graph_from_exon_parts
    assert active_start <= exon.start - 1

Problematic gene:


GL000194.1      ENSEMBL gene    53590   115018  .       -       .       gene_id "ENSG00000277400.1"; gene_type "protein_coding"; gene_name "ENSG00000277400"; level 3;
GL000194.1      ENSEMBL transcript      53590   115018  .       -       .       gene_id "ENSG00000277400.1"; transcript_id "ENST00000613230.1"; gene_type "protein_coding"; gene_name "ENSG00000277400"; transcript_type "protein_coding"; transcript_name "ENST00000613230"; level 3; protein_id "ENSP00000483280.1"; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical";
GL000194.1      ENSEMBL exon    114986  115018  .       -       .       gene_id "ENSG00000277400.1"; transcript_id "ENST00000613230.1"; gene_type "protein_coding"; gene_name "ENSG00000277400"; transcript_type "protein_coding"; transcript_name "ENST00000613230"; exon_number 1; exon_id "ENSE00002299440.2"; level 3; protein_id "ENSP00000483280.1"; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical";
GL000194.1      ENSEMBL exon    112792  112850  .       -       .       gene_id "ENSG00000277400.1"; transcript_id "ENST00000613230.1"; gene_type "protein_coding"; gene_name "ENSG00000277400"; transcript_type "protein_coding"; transcript_name "ENST00000613230"; exon_number 2; exon_id "ENSE00003739295.1"; level 3; protein_id "ENSP00000483280.1"; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical";
GL000194.1      ENSEMBL exon    53590   55676   .       -       .       gene_id "ENSG00000277400.1"; transcript_id "ENST00000613230.1"; gene_type "protein_coding"; gene_name "ENSG00000277400"; transcript_type "protein_coding"; transcript_name "ENST00000613230"; exon_number 3; exon_id "ENSE00003723764.1"; level 3; protein_id "ENSP00000483280.1"; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical";
GL000194.1      ENSEMBL CDS     53650   54021   .       -       0       gene_id "ENSG00000277400.1"; transcript_id "ENST00000613230.1"; gene_type "protein_coding"; gene_name "ENSG00000277400"; transcript_type "protein_coding"; transcript_name "ENST00000613230"; exon_number 3; exon_id "ENSE00003723764.1"; level 3; protein_id "ENSP00000483280.1"; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical";
GL000194.1      ENSEMBL start_codon     54019   54021   .       -       0       gene_id "ENSG00000277400.1"; transcript_id "ENST00000613230.1"; gene_type "protein_coding"; gene_name "ENSG00000277400"; transcript_type "protein_coding"; transcript_name "ENST00000613230"; exon_number 3; exon_id "ENSE00003723764.1"; level 3; protein_id "ENSP00000483280.1"; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical";
GL000194.1      ENSEMBL stop_codon      53647   53649   .       -       0       gene_id "ENSG00000277400.1"; transcript_id "ENST00000613230.1"; gene_type "protein_coding"; gene_name "ENSG00000277400"; transcript_type "protein_coding"; transcript_name "ENST00000613230"; exon_number 3; exon_id "ENSE00003723764.1"; level 3; protein_id "ENSP00000483280.1"; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical";
GL000194.1      ENSEMBL UTR     114986  115018  .       -       .       gene_id "ENSG00000277400.1"; transcript_id "ENST00000613230.1"; gene_type "protein_coding"; gene_name "ENSG00000277400"; transcript_type "protein_coding"; transcript_name "ENST00000613230"; exon_number 1; exon_id "ENSE00002299440.2"; level 3; protein_id "ENSP00000483280.1"; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical";
GL000194.1      ENSEMBL UTR     112792  112850  .       -       .       gene_id "ENSG00000277400.1"; transcript_id "ENST00000613230.1"; gene_type "protein_coding"; gene_name "ENSG00000277400"; transcript_type "protein_coding"; transcript_name "ENST00000613230"; exon_number 2; exon_id "ENSE00003739295.1"; level 3; protein_id "ENSP00000483280.1"; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical";
GL000194.1      ENSEMBL UTR     53590   53649   .       -       .       gene_id "ENSG00000277400.1"; transcript_id "ENST00000613230.1"; gene_type "protein_coding"; gene_name "ENSG00000277400"; transcript_type "protein_coding"; transcript_name "ENST00000613230"; exon_number 3; exon_id "ENSE00003723764.1"; level 3; protein_id "ENSP00000483280.1"; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical";
GL000194.1      ENSEMBL UTR     54022   55676   .       -       .       gene_id "ENSG00000277400.1"; transcript_id "ENST00000613230.1"; gene_type "protein_coding"; gene_name "ENSG00000277400"; transcript_type "protein_coding"; transcript_name "ENST00000613230"; exon_number 3; exon_id "ENSE00003723764.1"; level 3; protein_id "ENSP00000483280.1"; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical";```
ksahlin commented 4 months ago

Hi @FadelBerakdar,

When developing uLTRA it was my experience that exons would be ordered w.r.t. start coordinate in the GTF file. So if you have a way to sort the exons w.r.t. start coordinate uLTRA will run without error.

FadelBerakdar commented 3 months ago

Hi @ksahlin

Sorting the gtf file using gff3sort.pl solved the problem.

ksahlin commented 3 months ago

Great, thanks for reporting.