Closed aleighbrown closed 4 years ago
Hi,
It looks like it's having an issue with the GENCODE GTF. I have a couple of questions if you don't mind me asking:
samtools view -H 6_27_4_18.Aligned.sorted.out.bam
TEtranscripts
package using pip? If you had installed the TEToolkit
package, please note that this has been renamed to TEtranscripts
, and the older name will no longer receive updates.Thanks.
1.
@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:248956422
@SQ SN:10 LN:133797422
@SQ SN:11 LN:135086622
@SQ SN:12 LN:133275309
@SQ SN:13 LN:114364328
@SQ SN:14 LN:107043718
@SQ SN:15 LN:101991189
@SQ SN:16 LN:90338345
@SQ SN:17 LN:83257441
@SQ SN:18 LN:80373285
@SQ SN:19 LN:58617616
@SQ SN:2 LN:242193529
@SQ SN:20 LN:64444167
@SQ SN:21 LN:46709983
@SQ SN:22 LN:50818468
@SQ SN:3 LN:198295559
@SQ SN:4 LN:190214555
@SQ SN:5 LN:181538259
@SQ SN:6 LN:170805979
@SQ SN:7 LN:159345973
@SQ SN:8 LN:145138636
@SQ SN:9 LN:138394717
@SQ SN:MT LN:16569
@SQ SN:X LN:156040895
@SQ SN:Y LN:57227415
@SQ SN:KI270728.1 LN:1872759
@SQ SN:KI270727.1 LN:448248
@SQ SN:KI270442.1 LN:392061
@SQ SN:KI270729.1 LN:280839
@SQ SN:GL000225.1 LN:211173
@SQ SN:KI270743.1 LN:210658
@SQ SN:GL000008.2 LN:209709
@SQ SN:GL000009.2 LN:201709
@SQ SN:KI270747.1 LN:198735
@SQ SN:KI270722.1 LN:194050
@SQ SN:GL000194.1 LN:191469
@SQ SN:KI270742.1 LN:186739
@SQ SN:GL000205.2 LN:185591
@SQ SN:GL000195.1 LN:182896
@SQ SN:KI270736.1 LN:181920
@SQ SN:KI270733.1 LN:179772
@SQ SN:GL000224.1 LN:179693
@SQ SN:GL000219.1 LN:179198
@SQ SN:KI270719.1 LN:176845
@SQ SN:GL000216.2 LN:176608
@SQ SN:KI270712.1 LN:176043
@SQ SN:KI270706.1 LN:175055
@SQ SN:KI270725.1 LN:172810
@SQ SN:KI270744.1 LN:168472
@SQ SN:KI270734.1 LN:165050
@SQ SN:GL000213.1 LN:164239
@SQ SN:GL000220.1 LN:161802
@SQ SN:KI270715.1 LN:161471
@SQ SN:GL000218.1 LN:161147
@SQ SN:KI270749.1 LN:158759
@SQ SN:KI270741.1 LN:157432
@SQ SN:GL000221.1 LN:155397
@SQ SN:KI270716.1 LN:153799
@SQ SN:KI270731.1 LN:150754
@SQ SN:KI270751.1 LN:150742
@SQ SN:KI270750.1 LN:148850
@SQ SN:KI270519.1 LN:138126
@SQ SN:GL000214.1 LN:137718
@SQ SN:KI270708.1 LN:127682
@SQ SN:KI270730.1 LN:112551
@SQ SN:KI270438.1 LN:112505
@SQ SN:KI270737.1 LN:103838
@SQ SN:KI270721.1 LN:100316
@SQ SN:KI270738.1 LN:99375
@SQ SN:KI270748.1 LN:93321
@SQ SN:KI270435.1 LN:92983
@SQ SN:GL000208.1 LN:92689
@SQ SN:KI270538.1 LN:91309
@SQ SN:KI270756.1 LN:79590
@SQ SN:KI270739.1 LN:73985
@SQ SN:KI270757.1 LN:71251
@SQ SN:KI270709.1 LN:66860
@SQ SN:KI270746.1 LN:66486
@SQ SN:KI270753.1 LN:62944
@SQ SN:KI270589.1 LN:44474
@SQ SN:KI270726.1 LN:43739
@SQ SN:KI270735.1 LN:42811
@SQ SN:KI270711.1 LN:42210
@SQ SN:KI270745.1 LN:41891
@SQ SN:KI270714.1 LN:41717
@SQ SN:KI270732.1 LN:41543
@SQ SN:KI270713.1 LN:40745
@SQ SN:KI270754.1 LN:40191
@SQ SN:KI270710.1 LN:40176
@SQ SN:KI270717.1 LN:40062
@SQ SN:KI270724.1 LN:39555
@SQ SN:KI270720.1 LN:39050
@SQ SN:KI270723.1 LN:38115
@SQ SN:KI270718.1 LN:38054
@SQ SN:KI270317.1 LN:37690
@SQ SN:KI270740.1 LN:37240
@SQ SN:KI270755.1 LN:36723
@SQ SN:KI270707.1 LN:32032
@SQ SN:KI270579.1 LN:31033
@SQ SN:KI270752.1 LN:27745
@SQ SN:KI270512.1 LN:22689
@SQ SN:KI270322.1 LN:21476
@SQ SN:GL000226.1 LN:15008
@SQ SN:KI270311.1 LN:12399
@SQ SN:KI270366.1 LN:8320
@SQ SN:KI270511.1 LN:8127
@SQ SN:KI270448.1 LN:7992
@SQ SN:KI270521.1 LN:7642
@SQ SN:KI270581.1 LN:7046
@SQ SN:KI270582.1 LN:6504
@SQ SN:KI270515.1 LN:6361
@SQ SN:KI270588.1 LN:6158
@SQ SN:KI270591.1 LN:5796
@SQ SN:KI270522.1 LN:5674
@SQ SN:KI270507.1 LN:5353
@SQ SN:KI270590.1 LN:4685
@SQ SN:KI270584.1 LN:4513
@SQ SN:KI270320.1 LN:4416
@SQ SN:KI270382.1 LN:4215
@SQ SN:KI270468.1 LN:4055
@SQ SN:KI270467.1 LN:3920
@SQ SN:KI270362.1 LN:3530
@SQ SN:KI270517.1 LN:3253
@SQ SN:KI270593.1 LN:3041
@SQ SN:KI270528.1 LN:2983
@SQ SN:KI270587.1 LN:2969
@SQ SN:KI270364.1 LN:2855
@SQ SN:KI270371.1 LN:2805
@SQ SN:KI270333.1 LN:2699
@SQ SN:KI270374.1 LN:2656
@SQ SN:KI270411.1 LN:2646
@SQ SN:KI270414.1 LN:2489
@SQ SN:KI270510.1 LN:2415
@SQ SN:KI270390.1 LN:2387
@SQ SN:KI270375.1 LN:2378
@SQ SN:KI270420.1 LN:2321
@SQ SN:KI270509.1 LN:2318
@SQ SN:KI270315.1 LN:2276
@SQ SN:KI270302.1 LN:2274
@SQ SN:KI270518.1 LN:2186
@SQ SN:KI270530.1 LN:2168
@SQ SN:KI270304.1 LN:2165
@SQ SN:KI270418.1 LN:2145
@SQ SN:KI270424.1 LN:2140
@SQ SN:KI270417.1 LN:2043
@SQ SN:KI270508.1 LN:1951
@SQ SN:KI270303.1 LN:1942
@SQ SN:KI270381.1 LN:1930
@SQ SN:KI270529.1 LN:1899
@SQ SN:KI270425.1 LN:1884
@SQ SN:KI270396.1 LN:1880
@SQ SN:KI270363.1 LN:1803
@SQ SN:KI270386.1 LN:1788
@SQ SN:KI270465.1 LN:1774
@SQ SN:KI270383.1 LN:1750
@SQ SN:KI270384.1 LN:1658
@SQ SN:KI270330.1 LN:1652
@SQ SN:KI270372.1 LN:1650
@SQ SN:KI270548.1 LN:1599
@SQ SN:KI270580.1 LN:1553
@SQ SN:KI270387.1 LN:1537
@SQ SN:KI270391.1 LN:1484
@SQ SN:KI270305.1 LN:1472
@SQ SN:KI270373.1 LN:1451
@SQ SN:KI270422.1 LN:1445
@SQ SN:KI270316.1 LN:1444
@SQ SN:KI270340.1 LN:1428
@SQ SN:KI270338.1 LN:1428
@SQ SN:KI270583.1 LN:1400
@SQ SN:KI270334.1 LN:1368
@SQ SN:KI270429.1 LN:1361
@SQ SN:KI270393.1 LN:1308
@SQ SN:KI270516.1 LN:1300
@SQ SN:KI270389.1 LN:1298
@SQ SN:KI270466.1 LN:1233
@SQ SN:KI270388.1 LN:1216
@SQ SN:KI270544.1 LN:1202
@SQ SN:KI270310.1 LN:1201
@SQ SN:KI270412.1 LN:1179
@SQ SN:KI270395.1 LN:1143
@SQ SN:KI270376.1 LN:1136
@SQ SN:KI270337.1 LN:1121
@SQ SN:KI270335.1 LN:1048
@SQ SN:KI270378.1 LN:1048
@SQ SN:KI270379.1 LN:1045
@SQ SN:KI270329.1 LN:1040
@SQ SN:KI270419.1 LN:1029
@SQ SN:KI270336.1 LN:1026
@SQ SN:KI270312.1 LN:998
@SQ SN:KI270539.1 LN:993
@SQ SN:KI270385.1 LN:990
@SQ SN:KI270423.1 LN:981
@SQ SN:KI270392.1 LN:971
@SQ SN:KI270394.1 LN:970
@PG ID:STAR PN:STAR VN:2.7.0f CL:/home/annbrown/STAR/bin/Linux_x86_64_static/STAR --runThreadN 4 --genomeDir /SAN/vyplab/vyplab_reference_genomes/STAR/human/GrCh38.96/star_indices_overhang149 --readFilesIn /SAN/vyplab/alb_projects/data/muscle/analysis/merged_fastqs/6_27_4_18_1.merged.fastq.gz /SAN/vyplab/alb_projects/data/muscle/analysis/merged_fastqs/6_27_4_18_2.merged.fastq.gz --readFilesCommand zcat --outFileNamePrefix /SAN/vyplab/alb_projects/data/muscle/analysis/STAR_aligned_newstar/6_27_4_18. --outTmpDir /SAN/vyplab/alb_projects/data/muscle/analysis/STAR_aligned_newstar/6_27_4_18_tmpdir --outSAMtype BAM Unsorted --outSAMattributes MD NH --outSAMunmapped Within --alignEndsType EndToEnd --quantMode GeneCounts --twopassMode Basic
@CO user command line: /home/annbrown/STAR/bin/Linux_x86_64_static/STAR --genomeDir /SAN/vyplab/vyplab_reference_genomes/STAR/human/GrCh38.96/star_indices_overhang149 --readFilesIn /SAN/vyplab/alb_projects/data/muscle/analysis/merged_fastqs/6_27_4_18_1.merged.fastq.gz /SAN/vyplab/alb_projects/data/muscle/analysis/merged_fastqs/6_27_4_18_2.merged.fastq.gz --outFileNamePrefix /SAN/vyplab/alb_projects/data/muscle/analysis/STAR_aligned_newstar/6_27_4_18. --readFilesCommand zcat --runThreadN 4 --outSAMattributes MD NH --quantMode GeneCounts --outSAMtype BAM Unsorted --outSAMunmapped Within --twopassMode Basic --alignEndsType EndToEnd --outTmpDir /SAN/vyplab/alb_projects/data/muscle/analysis/STAR_aligned_newstar/6_27_4_18_tmpdir
With sed, I don't remember the exact command, something lke sed -e 's/chr//'
Installed with pip.
Also ran this by downloading the files locally and running on a local install. That seemed to run fine but gave no counts for the TE
header
gene/TE /Users/annaleigh/Documents/data/italian_muscle/testing_te/6_27_4_18.Aligned.sorted.out.bam
"ENSG00000000003.14" 0
"ENSG00000000005.6" 1
"ENSG00000000419.12" 4
"ENSG00000000457.14" 26
"ENSG00000000460.17" 62
"ENSG00000000938.13" 0
"ENSG00000000971.15" 0
"ENSG00000001036.13" 8
TEcount -b /Users/annaleigh/Documents/data/italian_muscle/testing_te/6_27_4_18.Aligned.sorted.out.bam \
> --sortByPos \
> --GTF /Users/annaleigh/Documents/data/italian_muscle/testing_te/gencode.v31.no_chr.annotation.gtf \
> --TE /Users/annaleigh/Documents/data/italian_muscle/testing_te/GRCh38_rmsk_TE.gtf.gz \
> --project ibm_2
INFO @ Tue, 05 Nov 2019 12:30:53:
# ARGUMENTS LIST:
# name = ibm_2
# BAM file = /Users/annaleigh/Documents/data/italian_muscle/testing_te/6_27_4_18.Aligned.sorted.out.bam
# GTF file = /Users/annaleigh/Documents/data/italian_muscle/testing_te/gencode.v31.no_chr.annotation.gtf
# TE file = /Users/annaleigh/Documents/data/italian_muscle/testing_te/GRCh38_rmsk_TE.gtf.gz
# multi-mapper mode = multi
# stranded = yes
# number of iteration = 100
# Alignments grouped by read ID = False
INFO @ Tue, 05 Nov 2019 12:30:53: Processing GTF files ...
INFO @ Tue, 05 Nov 2019 12:30:53: Building gene index .......
100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
INFO @ Tue, 05 Nov 2019 12:41:09: Done building gene index ......
sort: mbrtowc error: Illegal byte sequence
INFO @ Tue, 05 Nov 2019 12:41:09:
Building TE index .......
INFO @ Tue, 05 Nov 2019 12:41:09: Done building TE index ......
INFO @ Tue, 05 Nov 2019 12:41:09:
Reading sample file ...
uniq te counts = 0
.......start iterative optimization ..........
TE counts total 0
Gene counts total 2637229.74841
In library /Users/annaleigh/Documents/data/italian_muscle/testing_te/6_27_4_18.Aligned.sorted.out.bam:
Total annotated reads = 2637229.74841
Total non-uniquely mapped reads = 7357402
Total unannotated reads = 29819205
I'm not interested in gene counts either, is there way to only get TE counts?
Hi,
I think I might have found a potential reason for the failure. It looks like the TE GTF file is still compressed (gzip-ped). You will need to decompress it before using it for TEcount. Sorry, I should have looked more closely at the command line. Hopefully that should fix the problem.
Thanks for your interest in the software, and let us know if you encounter any more issues.
Whoops.... >.<
Had just sort of assumed it would be able to read a gzipped file...
On Tue, Nov 5, 2019, 5:37 PM Oliver Tam notifications@github.com wrote:
Hi,
I think I might have found a potential reason for the failure. It looks like the TE GTF file is still compressed (gzip-ped). You will need to decompress it before using it for TEcount. Sorry, I should have looked more closely at the command line. Hopefully that should fix the problem.
Thanks for your interest in the software, and let us know if you encounter any more issues.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mhammell-laboratory/TEtranscripts/issues/52?email_source=notifications&email_token=ABWUNNHHYV2ME62HY7CVFGLQSGVNVA5CNFSM4JJB5KBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDDT6XY#issuecomment-549928799, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWUNNFLMRVMXHCAWVPYTA3QSGVNVANCNFSM4JJB5KBA .
That might not be a bad functionality to include, but for now, the GTF files are processed as plain text. Please let us know if it still doesn't work. Thanks.
Thank you, seems to be running alright for now!