mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Error in building gene/TE index #192

Closed 77CXiao closed 3 days ago

77CXiao commented 5 days ago

Hi, I read all the error related to reply, but I still couldn't solve the problem, want to have a look at the trouble you for help.
GRCh38_gene.gtf:

chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000290825.1"; gene_type "lncRNA"; gene_name "DDX11L2"; level 2; tag "overlaps_pseudogene";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";

GRCh38_rmsk.gtf:

chr1    hg38_rmsk       exon    67108754        67109046        1892.000000     +       .       gene_id "L1P5"; transcript_id "L1P5"; 
chr1    hg38_rmsk       exon    8388316 8388618 2582.000000     -       .       gene_id "AluY"; transcript_id "AluY"; 
chr1    hg38_rmsk       exon    25165804        25166380        4085.000000     +       .       gene_id "L1MB5"; transcript_id "L1MB5"; 
chr1    hg38_rmsk       exon    33554186        33554483        2285.000000     -       .       gene_id "AluSc"; transcript_id "AluSc";

BAM file:

$ samtools idxstats TSC_KO_rep2.bam | head:

chr1    248956422       3479217 0
chr2    242193529       1845877 0
chr3    198295559       1567627 0
chr4    190214555       894602  0
...

$ samtools view TSC_KO_rep2.bam | head -n 10
A00877:941:HYJM3DSX2:3:1416:23411:31735 99      chr1    10001   3       32S113M5S       =       10007   231     CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAAC       FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFF:FFFFF,FFFFF:,FFF:,FFF,::FFFF:,:FFF:,FF,FF,:F,F:F:F::F,FF,:::FF:F:,:F:F:F,FF       NH:i:2  HI:i:1  AS:i:255        nM:i:2  NM:i:2  MD:Z:72T37C2    jM:B:c,-1       jI:B:i,-1       MC:Z:95M77N53M
A00877:941:HYJM3DSX2:3:2664:16116:15812 99      chr1    10001   0       84M     =       10007   87      TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:F:F,FF,F:FFF:,F,F,:,,,F,FFF NH:i:10 HI:i:1  AS:i:157        nM:i:3  NM:i:0  MD:Z:84 jM:B:c,-1       jI:B:i,-1       MC:Z:69S81M
A00877:941:HYJM3DSX2:3:1678:15365:26819 99      chr1    10004   255     103M    =       10023   103     CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCTAACCCAAACCCTAACCCAAAC      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFFF:F,FFF,F,FF:FF,FF:FF,,,,F,,FF:,:,F:: NH:i:1  HI:i:1  AS:i:179        nM:i:3  NM:i:3  MD:Z:69T17T11T3 jM:B:c,-1       jI:B:i,-1   MC:Z:66S84M

I run the code in Python 3.10.9; TEtranscripts 2.2.3

INFO  @ Wed, 26 Jun 2024 23:17:03: 
# ARGUMENTS LIST:
# name = TSC
# BAM file = ../bam/TSC_KO_rep2.bam
# GTF file = /home/chencx/software/reference/GRCh38/GRCh38_gene.gtf 
# TE file = /home/chencx/software/reference/GRCh38/hg38_rmsk.gtf 
# multi-mapper mode = multi 
# stranded = no 
# number of iteration = 100
# Alignments grouped by read ID = False
INFO  @ Wed, 26 Jun 2024 23:17:03: Processing GTF files ... 
INFO  @ Wed, 26 Jun 2024 23:17:03: Building gene index ....... 
100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
1400000 GTF lines processed.
1500000 GTF lines processed.
1600000 GTF lines processed.
1700000 GTF lines processed.
INFO  @ Wed, 26 Jun 2024 23:33:21: Done building gene index ...... 
INFO  @ Wed, 26 Jun 2024 23:41:31: Building TE index ....... 
Error in building TE index

The error occurred during the TE index step, Can you help me? Thanks!

77CXiao commented 5 days ago

I install by this codes:

$ git clone https://github.com/mhammell-laboratory/TEtranscripts
$ cd ~/software/TEtranscripts
$ python setup.py install --user
olivertam commented 5 days ago

Hi,

Thank you for your interest in the software. Did you open the GTF file in Excel or something along that line? I noticed that the score values now have decimal places. While I don't think this should be an issue on its own, I do worry that somewhere further down, the genomic positions were converted to scientific notations, which has been known to cause errors (see bottom of #167). Also, I noticed that the family_id and class_id fields are missing, but I can't tell if it's because you didn't paste the full output. Just something else to check.

Please feel free to send the current version of your TE GTF to me, or try re-downloading the TE GTF again. Let me know if neither options work.

Thanks.

77CXiao commented 4 days ago

Thank you for your prompt reply. I have re-downloaded the hg38_rmsk_TE_trans.gtf and checked to ensure that the fourth and fifth columns are in numeric format. However, I am still encountering an error. Could you please take another look for me?

Thank you very much.

$ head hg38_rmsk_TE_trans.gtf:
chr1    UCSC_rmsk       exon    67108754        67109046        1892    +       .       gene_id "L1P5"; transcript_id "L1P5"; family_id "L1"; class_id "LINE"; gene_name "L1P5:TE";
chr1    UCSC_rmsk       exon    8388316 8388618 2582    -       .       gene_id "AluY"; transcript_id "AluY"; family_id "Alu"; class_id "SINE"; gene_name "AluY:TE";
chr1    UCSC_rmsk       exon    25165804        25166380        4085    +       .       gene_id "L1MB5"; transcript_id "L1MB5"; family_id "L1"; class_id "LINE"; gene_name "L1MB5:TE";
chr1    UCSC_rmsk       exon    33554186        33554483        2285    -       .       gene_id "AluSc"; transcript_id "AluSc"; family_id "Alu"; class_id "SINE"; gene_name "AluSc:TE";

when I run grep -e "e+" hg38_rmsk_TE_trans.gtf no output,so I dont kown what's made this error.

(base) chencx@sunlab:~/project/TE/test/DETE$ /home/chencx/software/TEtranscripts/bin/TEtranscripts \
>               --mode multi \
>               --format BAM \
>               -t ../bam/TSC_KO_rep1.bam ../bam/TSC_KO_rep2.bam \
>               -c ../bam/TSC_WT_rep1.bam ../bam/TSC_WT_rep2.bam \
>               --GTF ~/software/reference/GRCh38/GRCh38_gene.gtf \
>               --TE ~/software/reference/GRCh38/hg38_rmsk_TE_trans.gtf \
>               --foldchange 1 --sortByPos \
>               --project TSC
INFO  @ Thu, 27 Jun 2024 08:48:41: 
# ARGUMENTS LIST:
# name = TSC
# treatment files = ['../bam/TSC_KO_rep1.bam', '../bam/TSC_KO_rep2.bam']
# control files = ['../bam/TSC_WT_rep1.bam', '../bam/TSC_WT_rep2.bam']
# GTF file = /home/chencx/software/reference/GRCh38/GRCh38_gene.gtf 
# TE file = /home/chencx/software/reference/GRCh38/hg38_rmsk_TE_trans.gtf 
# multi-mapper mode = multi 
# stranded = no
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Thu, 27 Jun 2024 08:48:41: Processing GTF files ...

INFO  @ Thu, 27 Jun 2024 08:48:41: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
1400000 GTF lines processed.
1500000 GTF lines processed.
1600000 GTF lines processed.
1700000 GTF lines processed.
INFO  @ Thu, 27 Jun 2024 09:06:56: Done building gene index ...... 

INFO  @ Thu, 27 Jun 2024 09:14:17: Building TE index ....... 

Error in building TE index 

(The GTF file are too big 95MB, so I can't upload.)

$ shuf -n 500000 hg38_rmsk_TE_trans.gtf > hg38_rmsk_TE_sample.gtf

I randomly selected some and uploaded them to the attachment. hg38_rmsk_TE_sample.gtf.gz

77CXiao commented 4 days ago

when I open hg38_rmsk_TE_trans.gtf by Excel, I found that the first column contains a lot of uncommon chromosome names, such as chr16_GL383556v1_alt, chr1_KZ559100v1_fix, chr1_MU273330v1_alt, I used this code awk '$1 ~ /_alt/' hg38_rmsk_TE_trans.gtf | wc -l The proportion of this type of computation is about 4%. I don't know if that's the cause of the error, do you have any suggestions? Thanks

olivertam commented 4 days ago

Hi,

I am unable to replicate your issue with either the hg38 rmsk GTF or your sample GTF. Could I confirm that other than gunzip, you didn't make any modifications to the TE GTF file? This includes opening it in Excel (which I am surprised that you managed, as there are more lines than what Excel could handle).

Regarding the _alt and _fix entries, they are the alternate haplotypes and patches. Depending on what you're using from GENCODE (e.g. are you using the reference, primary assembly or all chromosomes?) you may or may not have those. These usually don't cause issues with TEtranscripts as if they are not present in your alignment file, they are ignored. However, since you're using GENCODE, it might not be a bad idea to download the GRCh38_GENCODE_rmsk_TE.gtf instead, as that would ensure that if you're working on the primary assembly, those annotations will be quantified.

Apologies if this is not helpful.

77CXiao commented 4 days ago

Thank you for your prompt reply.

  1. I did not modify the downloaded GRCh38_gene.gtf file; to ensure that the fourth and fifth columns are numeric, I used the script below to verify their numeric format again. input_file = 'hg38_rmsk_TE.gtf' output_file = 'hg38_rmsk_TE_trans.gtf' with open(input_file, 'r') as infile, open(output_file, 'w') as outfile: for line in infile: if line.startswith('#'): outfile.write(line) continue fields = line.strip().split('\t') fields[3] = str(int(fields[3])) fields[4] = str(int(fields[4])) outfile.write('\t'.join(fields) + '\n')
  2. I used Excel to open the hg38_rmsk_TE_sample.gtf.gz file that I sent to you, not the entire file.
  3. Despite removing these lines, I still got the same error.
  4. I will try yor your GRCh38_GENCODE_rmsk_TE.gtf. Thanks
77CXiao commented 4 days ago

Unfortunately, I used the file you uploaded, decompressed it without making any modifications, but still received the same error.

(base) chencx@sunlab:~/project/TE/test/DETE$ /home/chencx/software/TEtranscripts/bin/TEtranscripts \
>               --mode multi \
>               --format BAM \
>               -t ../bam/TSC_KO_rep1.bam ../bam/TSC_KO_rep2.bam \
>               -c ../bam/TSC_WT_rep1.bam ../bam/TSC_WT_rep2.bam \
>               --GTF ~/software/reference/GRCh38/GRCh38_gene.gtf \
>               --TE ~/software/reference/GRCh38/GRCh38_GENCODE_rmsk_TE.gtf \
>               --foldchange 1 --sortByPos \
>               --project TSC
INFO  @ Thu, 27 Jun 2024 11:09:02: 
# ARGUMENTS LIST:
# name = TSC
# treatment files = ['../bam/TSC_KO_rep1.bam', '../bam/TSC_KO_rep2.bam']
# control files = ['../bam/TSC_WT_rep1.bam', '../bam/TSC_WT_rep2.bam']
# GTF file = /home/chencx/software/reference/GRCh38/GRCh38_gene.gtf 
# TE file = /home/chencx/software/reference/GRCh38/GRCh38_GENCODE_rmsk_TE.gtf 
# multi-mapper mode = multi 
# stranded = no
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Thu, 27 Jun 2024 11:09:02: Processing GTF files ...

INFO  @ Thu, 27 Jun 2024 11:09:02: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
1400000 GTF lines processed.
1500000 GTF lines processed.
1600000 GTF lines processed.
1700000 GTF lines processed.
INFO  @ Thu, 27 Jun 2024 11:25:21: Done building gene index ...... 

INFO  @ Thu, 27 Jun 2024 11:25:26: Building TE index ....... 

Error in building TE index 
(base) chencx@sunlab:~/project/TE/test/DETE$ 
77CXiao commented 4 days ago

I solved the issue by creating a virtual environment. I think it might have been caused by having too many different software versions under my account, which could have resulted in some bugs.

Thank you for your suggestion. If anyone continue to encounter strange errors despite ensuring file correctness, I suggest creating a virtual environment to see if that helps. Hope everything goes smoothly.

cd ~/software/tools
conda create -n TE_Transcripts python=3.10
conda activate TE_Transcripts
git clone https://github.com/mhammell-laboratory/TEtranscripts
cd ./TEtranscripts
python setup.py install --user
~/software/tools/TEtranscripts/bin/TEtranscripts --version
~/software/tools/TEtranscripts/bin/TEtranscripts -h

Best wish~

olivertam commented 4 days ago

Hi,

Thank you for your suggestion. That is a certainly likely scenario if there are conflicting software versions either with TEtranscripts or its dependencies.

Thanks.