mhammell-laboratory / TElocal

A package for quantifying transposable elements at a locus level for RNAseq datasets.
GNU General Public License v3.0
21 stars 8 forks source link

Could not retrieve index file #14

Closed SJRussell closed 2 years ago

SJRussell commented 2 years ago

Hello and thank you for your amazing work. I have a small issue that I'm hoping is not affecting my results. I've run TElocal on some paired end RNA seq data, aligned with STAR, using the following command: ls | time parallel -j+0 --eta 'echo {} && TElocal --sortByPos -b {} --GTF /data/hg38_map_files/hg38.geneset.chr.ERCC.gtf --TE ~/GRCh38_GENCODE_rmsk_TE.gtf.locInd --project ../counts/{}'

It seems to have executed correctly, given that I'm getting gene and TE counts (example output below). However, right after 'Reading sample file ...' I get the error: [E::idx_find_and_load] Could not retrieve index file for '.1627488686.429493bam' Which I actually get this same error for every sample, with different numbers. Here are a couple others: '.1627488622.8951373bam' , '.1627488685.1249485bam'

I'm not even sure what this index file refers to. Any suggestions would be greatly appreciated, including comments on how "normal" my results may be.

Thank you!

Example output:

INFO  @ Wed, 28 Jul 2021 11:46:30:
# ARGUMENTS LIST:
# name = ../counts/C5Aligned.sortedByCoord.out.bam
# BAM file = C5Aligned.sortedByCoord.out.bam
# GTF file = /data/hg38_map_files/hg38.geneset.chr.ERCC.gtf
# TE file = /home/stewart/GRCh38_GENCODE_rmsk_TE.gtf.locInd
# multi-mapper mode = multi
# stranded = no
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Wed, 28 Jul 2021 11:46:30: Processing annotation files ...

INFO  @ Wed, 28 Jul 2021 11:46:30: Building gene index .......

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
INFO  @ Wed, 28 Jul 2021 12:08:00: Done building gene index ......

INFO  @ Wed, 28 Jul 2021 12:08:00: Attempting to load TE index ......

INFO  @ Wed, 28 Jul 2021 12:09:36: TE index loaded ......

INFO  @ Wed, 28 Jul 2021 12:09:36:
Reading sample file ...

[E::idx_find_and_load] Could not retrieve index file for '.1627488576.5132291bam'
uniq te counts = 8473240.999999447
.......start iterative optimization ..........
multi-reads = 806835 SQUAREM iteraton [1]
num of multi reads = 806835
total multi counts = 796715.0000000044
num of multi reads = 806835
total multi counts = 796715.0000000135
alpha = 1.0, SQUAREM iteraton [2]
num of multi reads = 806835
total multi counts = 796715.0000000114
num of multi reads = 806835
total multi counts = 796714.9999999789
num of multi reads = 806835                                                                                                                                                                                                         [60/1818]
total multi counts = 796714.9999999789
alpha = 1.627057572632444.
 Performing a stabilization step.
num of multi reads = 806835
total multi counts = 796714.9999999985
alpha = 1.627057572632444, SQUAREM iteraton [3]
num of multi reads = 806835
total multi counts = 796714.9999999212
num of multi reads = 806835
total multi counts = 796714.9999998825
alpha = 2.115330360700014.
 Performing a stabilization step.
num of multi reads = 806835
total multi counts = 796714.9999999366
alpha = 2.115330360700014, SQUAREM iteraton [4]
num of multi reads = 806835
total multi counts = 796714.9999998373
num of multi reads = 806835
total multi counts = 796714.9999997034
alpha = 2.908663026518061.
 Performing a stabilization step.
num of multi reads = 806835
total multi counts = 796714.9999999185
alpha = 2.908663026518061, SQUAREM iteraton [5]
num of multi reads = 806835
total multi counts = 796714.9999997807
num of multi reads = 806835
total multi counts = 796714.999999644
rNome < OPT_TOL
converge at iteration 5
num of multi reads = 806835
TE counts total 9269955.999995826
Gene counts total 2104478.6206290685

In library C5Aligned.sortedByCoord.out.bam:
Total annotated reads = 11374434.620624894
Total non-uniquely mapped reads = 1866817
Total unannotated reads = 4776661
olivertam commented 2 years ago

Hi,

Thank you for your interest in the software. This has been also reported in TEtranscripts (mhammell-laboratory/TEtranscripts#82), and it is related to a quirk in newer versions of pysam (pysam-developers/pysam#939). In our experience, it doesn't affect the output of the script, and thus your results should be fine. One thing I would confirm is whether your RNA-seq library is stranded, as the default parameter expects unstranded RNA-seq library. This can significantly change your quantification.

Thanks.

SJRussell commented 2 years ago

Thanks for the prompt reply! Glad to hear the error won't affect the results. I am using unstranded RNA lib prep so it should be correct. You can close the issue, but one last question: is it typical in your experience to see 5x more TE transcripts annotated compared with genes? I'm working with some extracellular samples of secreted RNA so this is very interesting. And my positive cellular controls show an inverse relationship, with 10x more genes than TEs annotated. Thanks again,

Stewart

olivertam commented 2 years ago

Hi,

We haven't worked with secreted RNA samples, so we definitely haven't noticed that. Sounds very interesting! All the best.

Thanks.