mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
217 stars 29 forks source link

Could not retrieve index file for '.1606144755.39.bam' #82

Closed kennethchenlab closed 3 years ago

kennethchenlab commented 3 years ago

I'm getting an error message regarding a file that doesn't exist:

INFO  @ Mon, 23 Nov 2020 09:05:26: Processing GTF files ... 

INFO  @ Mon, 23 Nov 2020 09:05:26: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
INFO  @ Mon, 23 Nov 2020 09:12:24: Done building gene index ...... 

INFO  @ Mon, 23 Nov 2020 09:12:46: Building TE index ....... 

INFO  @ Mon, 23 Nov 2020 09:19:15: Done building TE index ...... 

INFO  @ Mon, 23 Nov 2020 09:19:15: 
Reading sample file ... 

[E::idx_find_and_load] Could not retrieve index file for '.1606144755.39.bam'
uniq te counts = 0 

I'm inputting this:

~/bin/TEcount --sortByPos -b ~/pathTo/WT0117_S112.bam --GTF ~/pathTo/GRCh38_gencodev26.gtf --TE ~/pathTo/GRCh38_rmsk_TE.gtf --project TEcount117

The input bam is sorted and indexed, and there is no file anywhere named '.1606144755.39.bam'. Why is it looking for this filename?

olivertam commented 3 years ago

Hi,

As part of the --sortByPos parameter, TEcount is resorting the file by name (using samtools). It then tries to load the file (which is a hidden file in the same folder) for processing. I'm assuming that you have sufficient space on your system, but that's worth checking. One option is to resort the BAM file by read name (samtools sort -n ...), and then run TEcount without the --sortByPos parameter.

Also, please make sure that the chromosome names between your gene and TE GTF files are the same. I noticed that you're using GENCODE gene GTF, and our older GRCh38 RepeatMasker file. We have now created GRCh38 that are compatible with either Ensembl (which is likely the one you're using), and GENCODE (which might be the one you need).

Thanks.

olivertam commented 3 years ago

It might also be a strange quirk in the newer version of pysam/htslib, where it throws this message out, but doesn't appear to cause any issues. We have also seen this in our TEcount runs, but we still managed to get the expected output. We will look into it further if the error is still occurring for you.

Thanks.

kennethchenlab commented 3 years ago

Looks like it's working for me now! I downloaded the new gencode rmsk gtf, and I still get the "could not retrieve index file ..." error, but like you said, it doesn't cause any issues.

kennethchenlab commented 3 years ago

It looks like my TEcount output files only counts alignments to TE's, but not genes. When I count reads with stringtie using the exact same genic GTF, it counts lots of alignments to genes. Should I merge the two?

olivertam commented 3 years ago

May I ask for the following?

Thanks.

kennethchenlab commented 3 years ago

Thanks!

samtools view -H WT0117_S112.forTE.bam header.txt

Top of gene GTF:

##description: evidence-based annotation of the human genome (GRCh38), version 26 (Ensembl 88)
##provider: GENCODE
##contact: gencode-help@sanger.ac.uk
##format: gtf
##date: 2017-03-14
chr1    ENSEMBL gene    17369   17436   .   -   .   gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_name "MIR6859-1"; level 3;
chr1    ENSEMBL transcript  17369   17436   .   -   .   gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_name "MIR6859-1-201"; level 3; transcript_support_level "NA"; tag "basic";
chr1    ENSEMBL exon    17369   17436   .   -   .   gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_name "MIR6859-1-201"; exon_number 1; exon_id "ENSE00003746039.1"; level 3; transcript_support_level "NA"; tag "basic";
chr1    HAVANA  gene    29554   31109   .   +   .   gene_id "ENSG00000243485.5"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
chr1    HAVANA  transcript  29554   31097   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-001"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";

Top of TE GTF:

chr1    hg38_rmsk   exon    100000001   100000637   11325   -   .   gene_id "L1M2"; transcript_id "L1M2_dup229"; family_id "L1"; class_id "LINE";
chr1    hg38_rmsk   exon    10000002    10000239    1760    +   .   gene_id "AluSx3"; transcript_id "AluSx3_dup157"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100000744   100002612   11325   -   .   gene_id "L1M2"; transcript_id "L1M2_dup230"; family_id "L1"; class_id "LINE";
chr1    hg38_rmsk   exon    10000251    10000566    2225    +   .   gene_id "AluSx"; transcript_id "AluSx_dup700"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100002613   100002913   1799    -   .   gene_id "AluJr"; transcript_id "AluJr_dup3513"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100002914   100003133   11325   -   .   gene_id "L1M2"; transcript_id "L1M2_dup231"; family_id "L1"; class_id "LINE";
chr1    hg38_rmsk   exon    100003134   100003444   2452    +   .   gene_id "AluSq2"; transcript_id "AluSq2_dup2641"; family_id "Alu"; class_id "SINE";

Top of TEcount output:

gene/TE /pathTo/WT0117_S112.forTE.bam
"ENSG00000000003.14"    0
"ENSG00000000005.5" 0
"ENSG00000000419.12"    0
"ENSG00000000457.13"    0
"ENSG00000000460.16"    0
"ENSG00000000938.12"    0
"ENSG00000000971.15"    0
"ENSG00000001036.13"    0
"ENSG00000001084.10"    0

TE command:

~/bin/TEcount --sortByPos -b /pathTo/WT0117_S112.forTE.bam \
     --GTF /pathTo/GRCh38_gencode_v26.pseudogene_removed.gtf \
     --TE /pathTo/GRCh38_GENCODE_rmsk_TE.gtf \
     --project pathTo/TEcount/WT0117_S112

TEcount output: WT0117_S112.txt

olivertam commented 3 years ago

Hi,

I just took a quick look at your BAM headers, and it looks like the chromosome names are different from GENCODE (e.g. no chr prefix in the sequence name).

@SQ SN:1    LN:248956422
@SQ SN:10   LN:133797422
@SQ SN:11   LN:135086622
@SQ SN:12   LN:133275309
@SQ SN:13   LN:114364328
@SQ SN:14   LN:107043718
@SQ SN:15   LN:101991189
@SQ SN:16   LN:90338345
@SQ SN:17   LN:83257441
@SQ SN:18   LN:80373285
@SQ SN:19   LN:58617616
@SQ SN:2    LN:242193529

I'm assuming that you are aligning to the Ensembl GRCh38 FASTA, and not the GENCODE GRCh38 FASTA. Unfortunately, they named their canonical chromosomes differently (with Ensembl lacking the chr prefix), and thus nearly all the genes (which are "located" on chromosomes with the chr prefix) will not be matched to the current alignments.

There are two solutions:

  1. You can convert the Ensembl FASTA file into a GENCODE compatible one with the following code (Source: 10x Genomics), and then remap.
    # Modify sequence headers in the Ensembl FASTA to match the file
    # "GRCh38.primary_assembly.genome.fa" from GENCODE. Unplaced and unlocalized
    # sequences such as "KI270728.1" have the same names in both versions.
    #
    # Input FASTA:
    #   >1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
    #
    # Output FASTA:
    #   >chr1 1
    # sed commands:
    # 1. Replace metadata after space with original contig name, as in GENCODE
    # 2. Add "chr" to names of autosomes and sex chromosomes
    # 3. Handle the mitochrondrial chromosome
    cat [Ensembl FASTA] \
    | sed -E 's/^>(\S+).*/>\1 \1/' \
    | sed -E 's/^>([0-9]+|[XY]) />chr\1 /' \
    | sed -E 's/^>MT />chrM /' \
    > [Modified FASTA]
  2. You can convert the GENCODE GTF file to be Ensembl compatible, and then use that and the GRCh38_Ensembl_rmsk_TE.gtf file for TEcount. To convert GENCODE GTF:
    # 1. Fix mitrochondrial chromosome names
    # 2. Remove chr prefix
    cat [GENCODE GTF] | sed 's/^chrM/MT/;s/^chr//' > [Ensembl compatible GTF]

    Let me know if these fixes did not help, and I can troubleshoot further.

Thanks.

kennethchenlab commented 3 years ago

🤦 That was it! Thanks!

emattei commented 3 years ago

Hi, I am also encountering the same warning but I still get outputs in the end.

olivertam commented 3 years ago

Hi,

Thanks for letting us know. This is definitely a quirk in the newer samtools/pysam version (see this issue), but it does not seem to affect the output from the software. It is possible to suppress the warning, but we worry about suppressing actual errors, and thus decided not to patch this yet.

Thanks.