Closed 77CXiao closed 3 days ago
I install by this codes:
$ git clone https://github.com/mhammell-laboratory/TEtranscripts
$ cd ~/software/TEtranscripts
$ python setup.py install --user
Hi,
Thank you for your interest in the software.
Did you open the GTF file in Excel or something along that line? I noticed that the score values now have decimal places. While I don't think this should be an issue on its own, I do worry that somewhere further down, the genomic positions were converted to scientific notations, which has been known to cause errors (see bottom of #167).
Also, I noticed that the family_id
and class_id
fields are missing, but I can't tell if it's because you didn't paste the full output. Just something else to check.
Please feel free to send the current version of your TE GTF to me, or try re-downloading the TE GTF again. Let me know if neither options work.
Thanks.
Thank you for your prompt reply. I have re-downloaded the hg38_rmsk_TE_trans.gtf and checked to ensure that the fourth and fifth columns are in numeric format. However, I am still encountering an error. Could you please take another look for me?
Thank you very much.
$ head hg38_rmsk_TE_trans.gtf:
chr1 UCSC_rmsk exon 67108754 67109046 1892 + . gene_id "L1P5"; transcript_id "L1P5"; family_id "L1"; class_id "LINE"; gene_name "L1P5:TE";
chr1 UCSC_rmsk exon 8388316 8388618 2582 - . gene_id "AluY"; transcript_id "AluY"; family_id "Alu"; class_id "SINE"; gene_name "AluY:TE";
chr1 UCSC_rmsk exon 25165804 25166380 4085 + . gene_id "L1MB5"; transcript_id "L1MB5"; family_id "L1"; class_id "LINE"; gene_name "L1MB5:TE";
chr1 UCSC_rmsk exon 33554186 33554483 2285 - . gene_id "AluSc"; transcript_id "AluSc"; family_id "Alu"; class_id "SINE"; gene_name "AluSc:TE";
when I run grep -e "e+" hg38_rmsk_TE_trans.gtf
no output,so I dont kown what's made this error.
(base) chencx@sunlab:~/project/TE/test/DETE$ /home/chencx/software/TEtranscripts/bin/TEtranscripts \
> --mode multi \
> --format BAM \
> -t ../bam/TSC_KO_rep1.bam ../bam/TSC_KO_rep2.bam \
> -c ../bam/TSC_WT_rep1.bam ../bam/TSC_WT_rep2.bam \
> --GTF ~/software/reference/GRCh38/GRCh38_gene.gtf \
> --TE ~/software/reference/GRCh38/hg38_rmsk_TE_trans.gtf \
> --foldchange 1 --sortByPos \
> --project TSC
INFO @ Thu, 27 Jun 2024 08:48:41:
# ARGUMENTS LIST:
# name = TSC
# treatment files = ['../bam/TSC_KO_rep1.bam', '../bam/TSC_KO_rep2.bam']
# control files = ['../bam/TSC_WT_rep1.bam', '../bam/TSC_WT_rep2.bam']
# GTF file = /home/chencx/software/reference/GRCh38/GRCh38_gene.gtf
# TE file = /home/chencx/software/reference/GRCh38/hg38_rmsk_TE_trans.gtf
# multi-mapper mode = multi
# stranded = no
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff = 1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = False
INFO @ Thu, 27 Jun 2024 08:48:41: Processing GTF files ...
INFO @ Thu, 27 Jun 2024 08:48:41: Building gene index .......
100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
1400000 GTF lines processed.
1500000 GTF lines processed.
1600000 GTF lines processed.
1700000 GTF lines processed.
INFO @ Thu, 27 Jun 2024 09:06:56: Done building gene index ......
INFO @ Thu, 27 Jun 2024 09:14:17: Building TE index .......
Error in building TE index
(The GTF file are too big 95MB, so I can't upload.)
$ shuf -n 500000 hg38_rmsk_TE_trans.gtf > hg38_rmsk_TE_sample.gtf
I randomly selected some and uploaded them to the attachment. hg38_rmsk_TE_sample.gtf.gz
when I open hg38_rmsk_TE_trans.gtf by Excel, I found that the first column contains a lot of uncommon chromosome names, such as chr16_GL383556v1_alt, chr1_KZ559100v1_fix, chr1_MU273330v1_alt
, I used this code awk '$1 ~ /_alt/' hg38_rmsk_TE_trans.gtf | wc -l
The proportion of this type of computation is about 4%. I don't know if that's the cause of the error, do you have any suggestions? Thanks
Hi,
I am unable to replicate your issue with either the hg38 rmsk GTF or your sample GTF.
Could I confirm that other than gunzip
, you didn't make any modifications to the TE GTF file? This includes opening it in Excel (which I am surprised that you managed, as there are more lines than what Excel could handle).
Regarding the _alt
and _fix
entries, they are the alternate haplotypes and patches. Depending on what you're using from GENCODE (e.g. are you using the reference, primary assembly or all chromosomes?) you may or may not have those.
These usually don't cause issues with TEtranscripts as if they are not present in your alignment file, they are ignored.
However, since you're using GENCODE, it might not be a bad idea to download the GRCh38_GENCODE_rmsk_TE.gtf instead, as that would ensure that if you're working on the primary assembly, those annotations will be quantified.
Apologies if this is not helpful.
Thank you for your prompt reply.
input_file = 'hg38_rmsk_TE.gtf' output_file = 'hg38_rmsk_TE_trans.gtf' with open(input_file, 'r') as infile, open(output_file, 'w') as outfile: for line in infile: if line.startswith('#'): outfile.write(line) continue fields = line.strip().split('\t') fields[3] = str(int(fields[3])) fields[4] = str(int(fields[4])) outfile.write('\t'.join(fields) + '\n')
Unfortunately, I used the file you uploaded, decompressed it without making any modifications, but still received the same error.
(base) chencx@sunlab:~/project/TE/test/DETE$ /home/chencx/software/TEtranscripts/bin/TEtranscripts \
> --mode multi \
> --format BAM \
> -t ../bam/TSC_KO_rep1.bam ../bam/TSC_KO_rep2.bam \
> -c ../bam/TSC_WT_rep1.bam ../bam/TSC_WT_rep2.bam \
> --GTF ~/software/reference/GRCh38/GRCh38_gene.gtf \
> --TE ~/software/reference/GRCh38/GRCh38_GENCODE_rmsk_TE.gtf \
> --foldchange 1 --sortByPos \
> --project TSC
INFO @ Thu, 27 Jun 2024 11:09:02:
# ARGUMENTS LIST:
# name = TSC
# treatment files = ['../bam/TSC_KO_rep1.bam', '../bam/TSC_KO_rep2.bam']
# control files = ['../bam/TSC_WT_rep1.bam', '../bam/TSC_WT_rep2.bam']
# GTF file = /home/chencx/software/reference/GRCh38/GRCh38_gene.gtf
# TE file = /home/chencx/software/reference/GRCh38/GRCh38_GENCODE_rmsk_TE.gtf
# multi-mapper mode = multi
# stranded = no
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff = 1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = False
INFO @ Thu, 27 Jun 2024 11:09:02: Processing GTF files ...
INFO @ Thu, 27 Jun 2024 11:09:02: Building gene index .......
100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
1400000 GTF lines processed.
1500000 GTF lines processed.
1600000 GTF lines processed.
1700000 GTF lines processed.
INFO @ Thu, 27 Jun 2024 11:25:21: Done building gene index ......
INFO @ Thu, 27 Jun 2024 11:25:26: Building TE index .......
Error in building TE index
(base) chencx@sunlab:~/project/TE/test/DETE$
I solved the issue by creating a virtual environment. I think it might have been caused by having too many different software versions under my account, which could have resulted in some bugs.
Thank you for your suggestion. If anyone continue to encounter strange errors despite ensuring file correctness, I suggest creating a virtual environment to see if that helps. Hope everything goes smoothly.
cd ~/software/tools
conda create -n TE_Transcripts python=3.10
conda activate TE_Transcripts
git clone https://github.com/mhammell-laboratory/TEtranscripts
cd ./TEtranscripts
python setup.py install --user
~/software/tools/TEtranscripts/bin/TEtranscripts --version
~/software/tools/TEtranscripts/bin/TEtranscripts -h
Best wish~
Hi,
Thank you for your suggestion. That is a certainly likely scenario if there are conflicting software versions either with TEtranscripts or its dependencies.
Thanks.
Hi, I read all the error related to reply, but I still couldn't solve the problem, want to have a look at the trouble you for help.
GRCh38_gene.gtf:
GRCh38_rmsk.gtf:
BAM file:
I run the code in Python 3.10.9; TEtranscripts 2.2.3
The error occurred during the TE index step, Can you help me? Thanks!