mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated. E::bgzf_read] bgzf_read_block error -1 after 0 of 4 bytes-ERROR when reading sample BAM files #168

Closed yahanlian closed 5 months ago

yahanlian commented 6 months ago

Hi, thank you so much for your watching!

I encountered an error when using TEtranscripts to analyze bam data. TEtranscripts --format BAM --mode multi -t /home/ylian/testTEdata/SRR8703717.bam -c /home/ylian/testTEdata/SRR8703700.bam --GTF /home/ylian/generef/T2Tref/genegtf/ncbi_dataset/data/GCF_009914755.1/genomic.gtf --TE /home/ylian/generef/T2T_TEref/T2T_CHM13_v2_rmsk_TE.gtf --project sample_nosort_test

the error is

INFO  @ Sat, 16 Dec 2023 06:12:48: Processing GTF files ...

INFO  @ Sat, 16 Dec 2023 06:12:48: Building gene index .......

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
1400000 GTF lines processed.
1500000 GTF lines processed.
1600000 GTF lines processed.
1700000 GTF lines processed.
1800000 GTF lines processed.
1900000 GTF lines processed.
2000000 GTF lines processed.
2100000 GTF lines processed.
INFO  @ Sat, 16 Dec 2023 06:24:09: Done building gene index ......

INFO  @ Sat, 16 Dec 2023 06:26:31: Building TE index .......

INFO  @ Sat, 16 Dec 2023 06:32:39: Done building TE index ......

INFO  @ Sat, 16 Dec 2023 06:32:39:
Reading sample files ...

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
[E::bgzf_read] bgzf_read_block error -1 after 0 of 4 bytes
2000000 alignments processed.
8000000 alignments processed.
9000000 alignments processed.
3000000 alignments processed.
Error: truncated file
[Exception type: OSError, raised in calignmentfile.pyx:1642]

My samtools version is 1.3.1, however, I created a new environment for TEtranscripts, and I just installed the dependencies (DESeq2, pysam), I have checked my BAM file(not sort) using the following code:

samtools view SRR8703717.bam|head
samtools view SRR8703717.bam|tail

It did not report the same error, so I would like to ask how could I to solve this problem? Could it is because I use the different version of samtools? I use hisat2 to align the sequences to acquire my SAM file, and then use a new version of samtools to create my BAM file. Besides, I have tested my data using TEcounts and acquired the final file without reporting any error.

Thanks for your time!!!!

olivertam commented 6 months ago

Hi,

Thank you for your interest in the software. Could you run the following for me?

$ samtools quickcheck [BAM file that threw the error]

If it comes back fine, then we might need to dig into that particular BAM further.

Thanks.

olivertam commented 6 months ago

Also, could you confirm the chromosomes in your gene GTF file matches the TE GTF file? I can't remember if they are using the chr... nomenclature or the NC_.... nomenclature.

Thanks.

yahanlian commented 6 months ago

Hi, thank you so much for your reply. I have run the samtools quickcheck code, it did not report any result, and I checked my GTF file, my gene GTF file uses the NC_.... nomenclature, but I run the following code using my SAM file, it completed the whole process and just reported the error of DESeq2 since I did not using the duplicate data. The whole document is

/home/ylian/anaconda3/envs/TEtranscripts/bin/TEtranscripts:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('TEtranscripts==2.2.3', 'TEtranscripts')
INFO  @ Sat, 16 Dec 2023 08:03:47:
# ARGUMENTS LIST:
# name = sample_nosort_test
# treatment files = ['/home/ylian/testTEdata/SRR8703717.sam']
# control files = ['/home/ylian/testTEdata/SRR8703700.sam']
# GTF file = /home/ylian/generef/T2Tref/genegtf/ncbi_dataset/data/GCF_009914755.1/genomic.gtf
# TE file = /home/ylian/generef/T2T_TEref/T2T_CHM13_v2_rmsk_TE.gtf
# multi-mapper mode = multi
# stranded = no
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = True

INFO  @ Sat, 16 Dec 2023 08:03:47: Processing GTF files ...

INFO  @ Sat, 16 Dec 2023 08:03:47: Building gene index .......

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
1400000 GTF lines processed.
1500000 GTF lines processed.
1600000 GTF lines processed.
1700000 GTF lines processed.
1800000 GTF lines processed.
1900000 GTF lines processed.
2000000 GTF lines processed.
2100000 GTF lines processed.
INFO  @ Sat, 16 Dec 2023 08:14:42: Done building gene index ......

INFO  @ Sat, 16 Dec 2023 08:17:03: Building TE index .......

INFO  @ Sat, 16 Dec 2023 08:23:23: Done building TE index ......

INFO  @ Sat, 16 Dec 2023 08:23:23:
Reading sample files ...

2000000 alignments processed.
8000000 alignments processed.
9000000 alignments processed.
3000000 alignments processed.
uniq te counts = 2569378
.......start iterative optimization ..........
multi-reads = 221961 total means = 911
after normalization total means0 = 1.0000000000000864
SQUAREM iteraton [1]
1/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1135
after normalization total means = 0.999999999999997
2/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1160
after normalization total means = 1.000000000000029
alpha = 1.0, SQUAREM iteraton [2]
1/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1167
after normalization total means = 1.0000000000000127
2/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1169
after normalization total means = 1.0000000000000235
alpha = 2.707459412804499.
 Performing a stabilization step.
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1166
after normalization total means = 1.0000000000000249
alpha = 2.707459412804499, SQUAREM iteraton [3]
1/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1166
after normalization total means = 1.000000000000019
2/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1166
after normalization total means = 1.0000000000000207
alpha = 3.2978241010676825.
 Performing a stabilization step.
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000113
alpha = 3.2978241010676825, SQUAREM iteraton [4]
1/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000606
2/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000024
alpha = 4.0.
 Performing a stabilization step.
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000182
alpha = 4.0, SQUAREM iteraton [5]
1/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000269
2/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000338
alpha = 5.831937823437688.
 Performing a stabilization step.
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1164
after normalization total means = 1.0000000000000426
alpha = 5.831937823437688, SQUAREM iteraton [6]
1/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000349
2/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.000000000000023
alpha = 2.984267704728043.
 Performing a stabilization step.
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000209
alpha = 2.984267704728043, SQUAREM iteraton [7]
1/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000047
2/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 0.9999999999999808
alpha = 5.553817443247494.
 Performing a stabilization step.
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1164
after normalization total means = 1.0000000000000149
alpha = 5.553817443247494, SQUAREM iteraton [8]
1/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000313
2/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.000000000000005
alpha = 1.877562112521852.
 Performing a stabilization step.
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000286
alpha = 1.877562112521852, SQUAREM iteraton [9]
1/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.00000000000006
2/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000346
alpha = 6.203417246711036.
 Performing a stabilization step.
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000357
alpha = 6.203417246711036, SQUAREM iteraton [10]
1/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000038
2/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000522
alpha = 6.7477468721848926.
 Performing a stabilization step.
num of multi reads = 221961
total multi counts = 205773
total multi counts = 205773
total means = 1165
after normalization total means = 1.0000000000000324
alpha = 6.7477468721848926, SQUAREM iteraton [11]
1/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.0000000000000082
2/3
num of multi reads = 221961
total multi counts = 205774
total multi counts = 205774
total means = 1165
after normalization total means = 1.000000000000012
r2Nome = OPT_TOL
converge at iteration 11
num of multi reads = 221961
total multi counts = 205774
TE counts total 2775152.0000002505
Gene counts total 0.0

In library /home/ylian/testTEdata/SRR8703717.sam:
Total annotated reads = 2775152
Total non-uniquely mapped reads = 2333935
Total unannotated reads = 8389189

uniq te counts = 2067930
.......start iterative optimization ..........
multi-reads = 194006 total means = 649
after normalization total means0 = 1.0000000000000469
SQUAREM iteraton [1]
1/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 882
after normalization total means = 1.0000000000000062
2/3
num of multi reads = 194006
total multi counts = 183626
total multi counts = 183626
total means = 912
after normalization total means = 1.0000000000000098
alpha = 1.0, SQUAREM iteraton [2]
1/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 921
after normalization total means = 1.0000000000000024
2/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 925
after normalization total means = 0.999999999999999
alpha = 1.7803855466549618.
 Performing a stabilization step.
num of multi reads = 194006
total multi counts = 183626
total multi counts = 183626
total means = 928
after normalization total means = 1.0000000000000189
alpha = 1.7803855466549618, SQUAREM iteraton [3]
1/3
num of multi reads = 194006
total multi counts = 183626
total multi counts = 183626
total means = 929
after normalization total means = 0.999999999999987
2/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 929
after normalization total means = 1.0000000000000258
alpha = 3.6910329999173674.
 Performing a stabilization step.
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 930
after normalization total means = 1.0000000000000107
alpha = 3.6910329999173674, SQUAREM iteraton [4]
1/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 930
after normalization total means = 1.0000000000000067
2/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 930
after normalization total means = 0.9999999999999822
alpha = 4.0.
 Performing a stabilization step.
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 930
after normalization total means = 1.0000000000000129
alpha = 4.0, SQUAREM iteraton [5]
1/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 930
after normalization total means = 1.0000000000000138
2/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 930
after normalization total means = 1.0000000000000029
alpha = 4.887204633029251.
 Performing a stabilization step.
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 931
after normalization total means = 1.0000000000000178
alpha = 4.887204633029251, SQUAREM iteraton [6]
1/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 931
after normalization total means = 1.000000000000005
2/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 931
after normalization total means = 1.000000000000005
alpha = 5.755273562883458.
 Performing a stabilization step.
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 930
after normalization total means = 0.9999999999999959
alpha = 5.755273562883458, SQUAREM iteraton [7]
1/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 931
after normalization total means = 0.9999999999999997
2/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 931
after normalization total means = 1.0000000000000178
alpha = 4.558690708184777.
 Performing a stabilization step.
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 930
after normalization total means = 0.9999999999999949
alpha = 4.558690708184777, SQUAREM iteraton [8]
1/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 931
after normalization total means = 1.000000000000015
2/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 931
after normalization total means = 0.9999999999999808
alpha = 2.771504813482115.
 Performing a stabilization step.
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 931
after normalization total means = 0.9999999999999941
alpha = 2.771504813482115, SQUAREM iteraton [9]
1/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 931
after normalization total means = 0.999999999999986
2/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 931
after normalization total means = 1.0000000000000016
alpha = 6.560679303952728.
 Performing a stabilization step.
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 930
after normalization total means = 1.000000000000012
alpha = 6.560679303952728, SQUAREM iteraton [10]
1/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 931
after normalization total means = 1.000000000000011
2/3
num of multi reads = 194006
total multi counts = 183625
total multi counts = 183625
total means = 931
after normalization total means = 1.0000000000000329
r2Nome = OPT_TOL
converge at iteration 10
num of multi reads = 194006
total multi counts = 183625
TE counts total 2251556.9999999017
Gene counts total 0.0

In library /home/ylian/testTEdata/SRR8703700.sam:
Total annotated reads = 2251556
Total non-uniquely mapped reads = 3583563
Total unannotated reads = 11251199

INFO  @ Sat, 16 Dec 2023 08:55:35: Finished processing sample files
INFO  @ Sat, 16 Dec 2023 08:55:35: Generating counts table
INFO  @ Sat, 16 Dec 2023 08:55:35: Performing differential analysis ...

estimating size factors
estimating dispersions
Error in checkForExperimentalReplicates(object, modelMatrix) :

  The design matrix has the same number of samples and coefficients to fit,
  so estimation of dispersion is not possible. Treating samples
  as replicates was deprecated in v1.20 and no longer supported since v1.22.

Calls: DESeq ... estimateDispersions -> .local -> checkForExperimentalReplicates
Execution halted
INFO  @ Sat, 16 Dec 2023 08:55:58: Done

Thank you so much again for your help!!!

olivertam commented 6 months ago

Hi,

Can you confirm that it's not the other BAM file (SRR8703700.bam) that has the issue. Again the quickcheck command would be useful. Also, I noted that you are getting no gene annotationin the log file (Gene counts total 0.0). As you mentioned, the gene GTF file is using the NC_... nomenclature, but I suspect that your FASTA (and thus your BAM) is using the chr... nomenclature (since that's the one that is used in the TE GTF that you have). You will have to either find or convert the gene GTF to use the chr... nomenclature, or else you will not get any genes quantified. You can confirm that your BAM file is using the chr... nomenclature by checking the BAM header and looking for the @SQ tag:

$ samtools view -H [BAM] | grep "@SQ"

Thanks.

yahanlian commented 6 months ago

Hi, Thank you so much for your help! I have checked my other BAM files using quickcheck, they also did not report this error. I also reran my BAM and SAM data file using the gene GTF( chr .... nomenclature), the result of BAM file still reported this error:

INFO  @ Sat, 16 Dec 2023 11:07:35:
Reading sample files ...

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
[E::bgzf_read] bgzf_read_block error -1 after 0 of 4 bytes
2000000 alignments processed.
8000000 alignments processed.
9000000 alignments processed.
3000000 alignments processed.
Error: truncated file
[Exception type: OSError, raised in calignmentfile.pyx:1642]

And my SAM file finally got this result:

converge at iteration 10
num of multi reads = 194006
total multi counts = 183625
TE counts total 1959873.999999762
Gene counts total 5523685.19999954

In library /home/ylian/testTEdata/SRR8703700.sam:
Total annotated reads = 7483559
Total non-uniquely mapped reads = 3583563
Total unannotated reads = 3761193

INFO  @ Sat, 16 Dec 2023 12:00:29: Finished processing sample files
INFO  @ Sat, 16 Dec 2023 12:00:29: Generating counts table
INFO  @ Sat, 16 Dec 2023 12:00:29: Performing differential analysis ...

This result which could be thought that I complete the whole process? Thanks for your time!!!

olivertam commented 6 months ago

Hi,

It is very strange that you didn't have any errors in either BAM files, yet pysam (which is the python library we use to read files in) is reporting an error. It is also very weird that you were able to run TEcount on the BAM files without issue. However, since you were able to run with the SAM file to completion, I think you should have all the result files. If you want us to troubleshoot the BAM file further, feel free to provide them to us and we can see if we could replicate your error.

Thanks.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days