Closed kennethchenlab closed 3 years ago
Hi,
As part of the --sortByPos
parameter, TEcount is resorting the file by name (using samtools). It then tries to load the file (which is a hidden file in the same folder) for processing. I'm assuming that you have sufficient space on your system, but that's worth checking. One option is to resort the BAM file by read name (samtools sort -n ...
), and then run TEcount without the --sortByPos
parameter.
Also, please make sure that the chromosome names between your gene and TE GTF files are the same. I noticed that you're using GENCODE gene GTF, and our older GRCh38 RepeatMasker file. We have now created GRCh38 that are compatible with either Ensembl (which is likely the one you're using), and GENCODE (which might be the one you need).
Thanks.
It might also be a strange quirk in the newer version of pysam/htslib, where it throws this message out, but doesn't appear to cause any issues. We have also seen this in our TEcount runs, but we still managed to get the expected output. We will look into it further if the error is still occurring for you.
Thanks.
Looks like it's working for me now! I downloaded the new gencode rmsk gtf, and I still get the "could not retrieve index file ..." error, but like you said, it doesn't cause any issues.
It looks like my TEcount output files only counts alignments to TE's, but not genes. When I count reads with stringtie using the exact same genic GTF, it counts lots of alignments to genes. Should I merge the two?
May I ask for the following?
samtools view -H [your BAM]
Thanks.
Thanks!
samtools view -H WT0117_S112.forTE.bam header.txt
Top of gene GTF:
##description: evidence-based annotation of the human genome (GRCh38), version 26 (Ensembl 88)
##provider: GENCODE
##contact: gencode-help@sanger.ac.uk
##format: gtf
##date: 2017-03-14
chr1 ENSEMBL gene 17369 17436 . - . gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_name "MIR6859-1"; level 3;
chr1 ENSEMBL transcript 17369 17436 . - . gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_name "MIR6859-1-201"; level 3; transcript_support_level "NA"; tag "basic";
chr1 ENSEMBL exon 17369 17436 . - . gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_name "MIR6859-1-201"; exon_number 1; exon_id "ENSE00003746039.1"; level 3; transcript_support_level "NA"; tag "basic";
chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.5"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
chr1 HAVANA transcript 29554 31097 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-001"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
Top of TE GTF:
chr1 hg38_rmsk exon 100000001 100000637 11325 - . gene_id "L1M2"; transcript_id "L1M2_dup229"; family_id "L1"; class_id "LINE";
chr1 hg38_rmsk exon 10000002 10000239 1760 + . gene_id "AluSx3"; transcript_id "AluSx3_dup157"; family_id "Alu"; class_id "SINE";
chr1 hg38_rmsk exon 100000744 100002612 11325 - . gene_id "L1M2"; transcript_id "L1M2_dup230"; family_id "L1"; class_id "LINE";
chr1 hg38_rmsk exon 10000251 10000566 2225 + . gene_id "AluSx"; transcript_id "AluSx_dup700"; family_id "Alu"; class_id "SINE";
chr1 hg38_rmsk exon 100002613 100002913 1799 - . gene_id "AluJr"; transcript_id "AluJr_dup3513"; family_id "Alu"; class_id "SINE";
chr1 hg38_rmsk exon 100002914 100003133 11325 - . gene_id "L1M2"; transcript_id "L1M2_dup231"; family_id "L1"; class_id "LINE";
chr1 hg38_rmsk exon 100003134 100003444 2452 + . gene_id "AluSq2"; transcript_id "AluSq2_dup2641"; family_id "Alu"; class_id "SINE";
Top of TEcount output:
gene/TE /pathTo/WT0117_S112.forTE.bam
"ENSG00000000003.14" 0
"ENSG00000000005.5" 0
"ENSG00000000419.12" 0
"ENSG00000000457.13" 0
"ENSG00000000460.16" 0
"ENSG00000000938.12" 0
"ENSG00000000971.15" 0
"ENSG00000001036.13" 0
"ENSG00000001084.10" 0
TE command:
~/bin/TEcount --sortByPos -b /pathTo/WT0117_S112.forTE.bam \
--GTF /pathTo/GRCh38_gencode_v26.pseudogene_removed.gtf \
--TE /pathTo/GRCh38_GENCODE_rmsk_TE.gtf \
--project pathTo/TEcount/WT0117_S112
TEcount output: WT0117_S112.txt
Hi,
I just took a quick look at your BAM headers, and it looks like the chromosome names are different from GENCODE (e.g. no chr
prefix in the sequence name).
@SQ SN:1 LN:248956422
@SQ SN:10 LN:133797422
@SQ SN:11 LN:135086622
@SQ SN:12 LN:133275309
@SQ SN:13 LN:114364328
@SQ SN:14 LN:107043718
@SQ SN:15 LN:101991189
@SQ SN:16 LN:90338345
@SQ SN:17 LN:83257441
@SQ SN:18 LN:80373285
@SQ SN:19 LN:58617616
@SQ SN:2 LN:242193529
I'm assuming that you are aligning to the Ensembl GRCh38 FASTA, and not the GENCODE GRCh38 FASTA. Unfortunately, they named their canonical chromosomes differently (with Ensembl lacking the chr
prefix), and thus nearly all the genes (which are "located" on chromosomes with the chr
prefix) will not be matched to the current alignments.
There are two solutions:
# Modify sequence headers in the Ensembl FASTA to match the file
# "GRCh38.primary_assembly.genome.fa" from GENCODE. Unplaced and unlocalized
# sequences such as "KI270728.1" have the same names in both versions.
#
# Input FASTA:
# >1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
#
# Output FASTA:
# >chr1 1
# sed commands:
# 1. Replace metadata after space with original contig name, as in GENCODE
# 2. Add "chr" to names of autosomes and sex chromosomes
# 3. Handle the mitochrondrial chromosome
cat [Ensembl FASTA] \
| sed -E 's/^>(\S+).*/>\1 \1/' \
| sed -E 's/^>([0-9]+|[XY]) />chr\1 /' \
| sed -E 's/^>MT />chrM /' \
> [Modified FASTA]
# 1. Fix mitrochondrial chromosome names
# 2. Remove chr prefix
cat [GENCODE GTF] | sed 's/^chrM/MT/;s/^chr//' > [Ensembl compatible GTF]
Let me know if these fixes did not help, and I can troubleshoot further.
Thanks.
🤦 That was it! Thanks!
Hi, I am also encountering the same warning but I still get outputs in the end.
Hi,
Thanks for letting us know. This is definitely a quirk in the newer samtools/pysam version (see this issue), but it does not seem to affect the output from the software. It is possible to suppress the warning, but we worry about suppressing actual errors, and thus decided not to patch this yet.
Thanks.
I'm getting an error message regarding a file that doesn't exist:
I'm inputting this:
~/bin/TEcount --sortByPos -b ~/pathTo/WT0117_S112.bam --GTF ~/pathTo/GRCh38_gencodev26.gtf --TE ~/pathTo/GRCh38_rmsk_TE.gtf --project TEcount117
The input bam is sorted and indexed, and there is no file anywhere named '.1606144755.39.bam'. Why is it looking for this filename?