Error in building gene/TE index on linux machine , installed with pip

aleighbrown commented 4 years ago

TEcount -b /SAN/vyplab/alb_projects/data/muscle/analysis/STAR_aligned/6_27_4_18.Aligned.sorted.out.bam --sortByPos --GTF /SAN/vyplab/vyplab_reference_genomes/annotation/human/GRCh38/gencode.v31.no_chr.annotation.gtf --TE  /SAN/vyplab/vyplab_reference_genomes/annotation/human/transposable_elements/GRCh38_rmsk_TE.gtf.gz --project ibm_2 --verbose 1
100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
Error in building gene/TE index

head /SAN/vyplab/vyplab_reference_genomes/annotation/human/GRCh38/gencode.v31.no_chr.annotation.gtf 
##description: evidence-based annotation of the human genome (GRCh38), version 31 (Ensembl 97)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2019-06-27
1   HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
1   HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "lncRNA"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
1   HAVANA  exon    11869   12227   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "lncRNA"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
1   HAVANA  exon    12613   12721   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "lncRNA"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
1   HAVANA  exon    13221   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "lncRNA"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; hgnc

zcat /SAN/vyplab/vyplab_reference_genomes/annotation/human/transposable_elements/GRCh38_rmsk_TE.gtf.gz | head
1   hg38_rmsk   exon    100000001   100000637   11325   -   .   gene_id "L1M2"; transcript_id "L1M2_dup229"; family_id "L1"; class_id "LINE";
1   hg38_rmsk   exon    10000002    10000239    1760    +   .   gene_id "AluSx3"; transcript_id "AluSx3_dup157"; family_id "Alu"; class_id "SINE";
1   hg38_rmsk   exon    100000744   100002612   11325   -   .   gene_id "L1M2"; transcript_id "L1M2_dup230"; family_id "L1"; class_id "LINE";

samtools idxstats /SAN/vyplab/alb_projects/data/muscle/analysis/STAR_aligned/6_27_4_18.Aligned.sorted.out.bam | head
1   248956422   7263910 0
10  133797422   1699366 0
11  135086622   2447900 0
12  133275309   3601090 0
13  114364328   622140  0
14  107043718   2267774 0

olivertam commented 4 years ago

Hi,

It looks like it's having an issue with the GENCODE GTF. I have a couple of questions if you don't mind me asking:

Would you be able to provide the header for your BAM file? (samtools view -H 6_27_4_18.Aligned.sorted.out.bam
How did you process the GENCODE GTF file to remove the "chr" in the chromosome name?
Did you install the TEtranscripts package using pip? If you had installed the TEToolkit package, please note that this has been renamed to TEtranscripts, and the older name will no longer receive updates.

Thanks.

aleighbrown commented 4 years ago

1.

@HD VN:1.4  SO:coordinate
@SQ SN:1    LN:248956422
@SQ SN:10   LN:133797422
@SQ SN:11   LN:135086622
@SQ SN:12   LN:133275309
@SQ SN:13   LN:114364328
@SQ SN:14   LN:107043718
@SQ SN:15   LN:101991189
@SQ SN:16   LN:90338345
@SQ SN:17   LN:83257441
@SQ SN:18   LN:80373285
@SQ SN:19   LN:58617616
@SQ SN:2    LN:242193529
@SQ SN:20   LN:64444167
@SQ SN:21   LN:46709983
@SQ SN:22   LN:50818468
@SQ SN:3    LN:198295559
@SQ SN:4    LN:190214555
@SQ SN:5    LN:181538259
@SQ SN:6    LN:170805979
@SQ SN:7    LN:159345973
@SQ SN:8    LN:145138636
@SQ SN:9    LN:138394717
@SQ SN:MT   LN:16569
@SQ SN:X    LN:156040895
@SQ SN:Y    LN:57227415
@SQ SN:KI270728.1   LN:1872759
@SQ SN:KI270727.1   LN:448248
@SQ SN:KI270442.1   LN:392061
@SQ SN:KI270729.1   LN:280839
@SQ SN:GL000225.1   LN:211173
@SQ SN:KI270743.1   LN:210658
@SQ SN:GL000008.2   LN:209709
@SQ SN:GL000009.2   LN:201709
@SQ SN:KI270747.1   LN:198735
@SQ SN:KI270722.1   LN:194050
@SQ SN:GL000194.1   LN:191469
@SQ SN:KI270742.1   LN:186739
@SQ SN:GL000205.2   LN:185591
@SQ SN:GL000195.1   LN:182896
@SQ SN:KI270736.1   LN:181920
@SQ SN:KI270733.1   LN:179772
@SQ SN:GL000224.1   LN:179693
@SQ SN:GL000219.1   LN:179198
@SQ SN:KI270719.1   LN:176845
@SQ SN:GL000216.2   LN:176608
@SQ SN:KI270712.1   LN:176043
@SQ SN:KI270706.1   LN:175055
@SQ SN:KI270725.1   LN:172810
@SQ SN:KI270744.1   LN:168472
@SQ SN:KI270734.1   LN:165050
@SQ SN:GL000213.1   LN:164239
@SQ SN:GL000220.1   LN:161802
@SQ SN:KI270715.1   LN:161471
@SQ SN:GL000218.1   LN:161147
@SQ SN:KI270749.1   LN:158759
@SQ SN:KI270741.1   LN:157432
@SQ SN:GL000221.1   LN:155397
@SQ SN:KI270716.1   LN:153799
@SQ SN:KI270731.1   LN:150754
@SQ SN:KI270751.1   LN:150742
@SQ SN:KI270750.1   LN:148850
@SQ SN:KI270519.1   LN:138126
@SQ SN:GL000214.1   LN:137718
@SQ SN:KI270708.1   LN:127682
@SQ SN:KI270730.1   LN:112551
@SQ SN:KI270438.1   LN:112505
@SQ SN:KI270737.1   LN:103838
@SQ SN:KI270721.1   LN:100316
@SQ SN:KI270738.1   LN:99375
@SQ SN:KI270748.1   LN:93321
@SQ SN:KI270435.1   LN:92983
@SQ SN:GL000208.1   LN:92689
@SQ SN:KI270538.1   LN:91309
@SQ SN:KI270756.1   LN:79590
@SQ SN:KI270739.1   LN:73985
@SQ SN:KI270757.1   LN:71251
@SQ SN:KI270709.1   LN:66860
@SQ SN:KI270746.1   LN:66486
@SQ SN:KI270753.1   LN:62944
@SQ SN:KI270589.1   LN:44474
@SQ SN:KI270726.1   LN:43739
@SQ SN:KI270735.1   LN:42811
@SQ SN:KI270711.1   LN:42210
@SQ SN:KI270745.1   LN:41891
@SQ SN:KI270714.1   LN:41717
@SQ SN:KI270732.1   LN:41543
@SQ SN:KI270713.1   LN:40745
@SQ SN:KI270754.1   LN:40191
@SQ SN:KI270710.1   LN:40176
@SQ SN:KI270717.1   LN:40062
@SQ SN:KI270724.1   LN:39555
@SQ SN:KI270720.1   LN:39050
@SQ SN:KI270723.1   LN:38115
@SQ SN:KI270718.1   LN:38054
@SQ SN:KI270317.1   LN:37690
@SQ SN:KI270740.1   LN:37240
@SQ SN:KI270755.1   LN:36723
@SQ SN:KI270707.1   LN:32032
@SQ SN:KI270579.1   LN:31033
@SQ SN:KI270752.1   LN:27745
@SQ SN:KI270512.1   LN:22689
@SQ SN:KI270322.1   LN:21476
@SQ SN:GL000226.1   LN:15008
@SQ SN:KI270311.1   LN:12399
@SQ SN:KI270366.1   LN:8320
@SQ SN:KI270511.1   LN:8127
@SQ SN:KI270448.1   LN:7992
@SQ SN:KI270521.1   LN:7642
@SQ SN:KI270581.1   LN:7046
@SQ SN:KI270582.1   LN:6504
@SQ SN:KI270515.1   LN:6361
@SQ SN:KI270588.1   LN:6158
@SQ SN:KI270591.1   LN:5796
@SQ SN:KI270522.1   LN:5674
@SQ SN:KI270507.1   LN:5353
@SQ SN:KI270590.1   LN:4685
@SQ SN:KI270584.1   LN:4513
@SQ SN:KI270320.1   LN:4416
@SQ SN:KI270382.1   LN:4215
@SQ SN:KI270468.1   LN:4055
@SQ SN:KI270467.1   LN:3920
@SQ SN:KI270362.1   LN:3530
@SQ SN:KI270517.1   LN:3253
@SQ SN:KI270593.1   LN:3041
@SQ SN:KI270528.1   LN:2983
@SQ SN:KI270587.1   LN:2969
@SQ SN:KI270364.1   LN:2855
@SQ SN:KI270371.1   LN:2805
@SQ SN:KI270333.1   LN:2699
@SQ SN:KI270374.1   LN:2656
@SQ SN:KI270411.1   LN:2646
@SQ SN:KI270414.1   LN:2489
@SQ SN:KI270510.1   LN:2415
@SQ SN:KI270390.1   LN:2387
@SQ SN:KI270375.1   LN:2378
@SQ SN:KI270420.1   LN:2321
@SQ SN:KI270509.1   LN:2318
@SQ SN:KI270315.1   LN:2276
@SQ SN:KI270302.1   LN:2274
@SQ SN:KI270518.1   LN:2186
@SQ SN:KI270530.1   LN:2168
@SQ SN:KI270304.1   LN:2165
@SQ SN:KI270418.1   LN:2145
@SQ SN:KI270424.1   LN:2140
@SQ SN:KI270417.1   LN:2043
@SQ SN:KI270508.1   LN:1951
@SQ SN:KI270303.1   LN:1942
@SQ SN:KI270381.1   LN:1930
@SQ SN:KI270529.1   LN:1899
@SQ SN:KI270425.1   LN:1884
@SQ SN:KI270396.1   LN:1880
@SQ SN:KI270363.1   LN:1803
@SQ SN:KI270386.1   LN:1788
@SQ SN:KI270465.1   LN:1774
@SQ SN:KI270383.1   LN:1750
@SQ SN:KI270384.1   LN:1658
@SQ SN:KI270330.1   LN:1652
@SQ SN:KI270372.1   LN:1650
@SQ SN:KI270548.1   LN:1599
@SQ SN:KI270580.1   LN:1553
@SQ SN:KI270387.1   LN:1537
@SQ SN:KI270391.1   LN:1484
@SQ SN:KI270305.1   LN:1472
@SQ SN:KI270373.1   LN:1451
@SQ SN:KI270422.1   LN:1445
@SQ SN:KI270316.1   LN:1444
@SQ SN:KI270340.1   LN:1428
@SQ SN:KI270338.1   LN:1428
@SQ SN:KI270583.1   LN:1400
@SQ SN:KI270334.1   LN:1368
@SQ SN:KI270429.1   LN:1361
@SQ SN:KI270393.1   LN:1308
@SQ SN:KI270516.1   LN:1300
@SQ SN:KI270389.1   LN:1298
@SQ SN:KI270466.1   LN:1233
@SQ SN:KI270388.1   LN:1216
@SQ SN:KI270544.1   LN:1202
@SQ SN:KI270310.1   LN:1201
@SQ SN:KI270412.1   LN:1179
@SQ SN:KI270395.1   LN:1143
@SQ SN:KI270376.1   LN:1136
@SQ SN:KI270337.1   LN:1121
@SQ SN:KI270335.1   LN:1048
@SQ SN:KI270378.1   LN:1048
@SQ SN:KI270379.1   LN:1045
@SQ SN:KI270329.1   LN:1040
@SQ SN:KI270419.1   LN:1029
@SQ SN:KI270336.1   LN:1026
@SQ SN:KI270312.1   LN:998
@SQ SN:KI270539.1   LN:993
@SQ SN:KI270385.1   LN:990
@SQ SN:KI270423.1   LN:981
@SQ SN:KI270392.1   LN:971
@SQ SN:KI270394.1   LN:970
@PG ID:STAR PN:STAR VN:2.7.0f   CL:/home/annbrown/STAR/bin/Linux_x86_64_static/STAR   --runThreadN 4   --genomeDir /SAN/vyplab/vyplab_reference_genomes/STAR/human/GrCh38.96/star_indices_overhang149   --readFilesIn /SAN/vyplab/alb_projects/data/muscle/analysis/merged_fastqs/6_27_4_18_1.merged.fastq.gz   /SAN/vyplab/alb_projects/data/muscle/analysis/merged_fastqs/6_27_4_18_2.merged.fastq.gz      --readFilesCommand zcat      --outFileNamePrefix /SAN/vyplab/alb_projects/data/muscle/analysis/STAR_aligned_newstar/6_27_4_18.   --outTmpDir /SAN/vyplab/alb_projects/data/muscle/analysis/STAR_aligned_newstar/6_27_4_18_tmpdir   --outSAMtype BAM   Unsorted      --outSAMattributes MD   NH      --outSAMunmapped Within      --alignEndsType EndToEnd   --quantMode GeneCounts      --twopassMode Basic
@CO user command line: /home/annbrown/STAR/bin/Linux_x86_64_static/STAR --genomeDir /SAN/vyplab/vyplab_reference_genomes/STAR/human/GrCh38.96/star_indices_overhang149 --readFilesIn /SAN/vyplab/alb_projects/data/muscle/analysis/merged_fastqs/6_27_4_18_1.merged.fastq.gz /SAN/vyplab/alb_projects/data/muscle/analysis/merged_fastqs/6_27_4_18_2.merged.fastq.gz --outFileNamePrefix /SAN/vyplab/alb_projects/data/muscle/analysis/STAR_aligned_newstar/6_27_4_18. --readFilesCommand zcat --runThreadN 4 --outSAMattributes MD NH --quantMode GeneCounts --outSAMtype BAM Unsorted --outSAMunmapped Within --twopassMode Basic --alignEndsType EndToEnd --outTmpDir /SAN/vyplab/alb_projects/data/muscle/analysis/STAR_aligned_newstar/6_27_4_18_tmpdir

aleighbrown commented 4 years ago

With sed, I don't remember the exact command, something lke sed -e 's/chr//'
Installed with pip.
Also ran this by downloading the files locally and running on a local install. That seemed to run fine but gave no counts for the TE

header

gene/TE /Users/annaleigh/Documents/data/italian_muscle/testing_te/6_27_4_18.Aligned.sorted.out.bam
"ENSG00000000003.14"    0
"ENSG00000000005.6" 1
"ENSG00000000419.12"    4
"ENSG00000000457.14"    26
"ENSG00000000460.17"    62
"ENSG00000000938.13"    0
"ENSG00000000971.15"    0
"ENSG00000001036.13"    8

TEcount -b /Users/annaleigh/Documents/data/italian_muscle/testing_te/6_27_4_18.Aligned.sorted.out.bam \
> --sortByPos \
> --GTF /Users/annaleigh/Documents/data/italian_muscle/testing_te/gencode.v31.no_chr.annotation.gtf \
> --TE /Users/annaleigh/Documents/data/italian_muscle/testing_te/GRCh38_rmsk_TE.gtf.gz \
> --project ibm_2
INFO  @ Tue, 05 Nov 2019 12:30:53: 
# ARGUMENTS LIST:
# name = ibm_2
# BAM file = /Users/annaleigh/Documents/data/italian_muscle/testing_te/6_27_4_18.Aligned.sorted.out.bam
# GTF file = /Users/annaleigh/Documents/data/italian_muscle/testing_te/gencode.v31.no_chr.annotation.gtf 
# TE file = /Users/annaleigh/Documents/data/italian_muscle/testing_te/GRCh38_rmsk_TE.gtf.gz 
# multi-mapper mode = multi 
# stranded = yes 
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Tue, 05 Nov 2019 12:30:53: Processing GTF files ...

INFO  @ Tue, 05 Nov 2019 12:30:53: Building gene index .......

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
INFO  @ Tue, 05 Nov 2019 12:41:09: Done building gene index ......

sort: mbrtowc error: Illegal byte sequence
INFO  @ Tue, 05 Nov 2019 12:41:09: 
Building TE index .......

INFO  @ Tue, 05 Nov 2019 12:41:09: Done building TE index ......

INFO  @ Tue, 05 Nov 2019 12:41:09: 
Reading sample file ...

uniq te counts = 0 
.......start iterative optimization ..........
TE counts total 0
Gene counts total 2637229.74841

In library /Users/annaleigh/Documents/data/italian_muscle/testing_te/6_27_4_18.Aligned.sorted.out.bam:
Total annotated reads = 2637229.74841
Total non-uniquely mapped reads = 7357402
Total unannotated reads = 29819205

aleighbrown commented 4 years ago

I'm not interested in gene counts either, is there way to only get TE counts?

olivertam commented 4 years ago

Hi,

I think I might have found a potential reason for the failure. It looks like the TE GTF file is still compressed (gzip-ped). You will need to decompress it before using it for TEcount. Sorry, I should have looked more closely at the command line. Hopefully that should fix the problem.

Thanks for your interest in the software, and let us know if you encounter any more issues.

aleighbrown commented 4 years ago

Whoops.... >.<

Had just sort of assumed it would be able to read a gzipped file...

On Tue, Nov 5, 2019, 5:37 PM Oliver Tam notifications@github.com wrote:

Hi,

I think I might have found a potential reason for the failure. It looks like the TE GTF file is still compressed (gzip-ped). You will need to decompress it before using it for TEcount. Sorry, I should have looked more closely at the command line. Hopefully that should fix the problem.

Thanks for your interest in the software, and let us know if you encounter any more issues.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mhammell-laboratory/TEtranscripts/issues/52?email_source=notifications&email_token=ABWUNNHHYV2ME62HY7CVFGLQSGVNVA5CNFSM4JJB5KBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDDT6XY#issuecomment-549928799, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWUNNFLMRVMXHCAWVPYTA3QSGVNVANCNFSM4JJB5KBA .

olivertam commented 4 years ago

That might not be a bad functionality to include, but for now, the GTF files are processed as plain text. Please let us know if it still doesn't work. Thanks.

aleighbrown commented 4 years ago

Thank you, seems to be running alright for now!

mhammell-laboratory / TEtranscripts

Error in building gene/TE index on linux machine , installed with pip #52