mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Reading the bam files #143

Closed MehdiM98 closed 10 months ago

MehdiM98 commented 1 year ago

Hello, I am trying to run the TEtranscripts on 3 treatment and 3 control files (each is approximately 4 to 6gb). The gtf files are indexed but the scripts take forever to run. How much memory is needed to run the program on mouse genome.

Thank you

olivertam commented 1 year ago

Hi,

Thank you for your interest in the software. We recommend at least 10Gb of memory to run most analyses, though we prefer to allow up to 20Gb as a safety buffer. How long has it been running? If you could provide the command line and perhaps the log output of the tool, we might be able to see what stage it is at.

Thanks.

MehdiM98 commented 1 year ago

Hello Oliver, Thanks for the quick response. Here is where the software gets stuck:

I am using it in an HPC

INFO  @ Thu, 29 Jun 2023 02:54:01:                                                                                              
# ARGUMENTS LIST:                                                                                                               
# name = CHD3                                                                                                                   
# treatment files = ['HT1_AlignedSorted.bam', 'HT2_AlignedSorted.bam', 'HT3_AlignedSorted.bam']                                 
# control files = ['WT1_AlignedSorted.bam', 'WT2_AlignedSorted.bam', 'WT3_AlignedSorted.bam']                                   
# GTF file = mm10.refGene.gtf                                                                                                   
# TE file = mm10_rmsk_TE.gtf                                                                                                    
# multi-mapper mode = multi                                                                                                     
# stranded = no                                                                                                                 
# differential analysis using DESeq2                                                                                            
# normalization = DESeq2_default                                                                                                
# FDR cutoff = 5.00e-02                                                                                                         
# fold-change cutoff =  1.00                                                                                                    
# read count cutoff = 1                                                                                                         
# number of iteration = 100                                                                                                     
# Alignments grouped by read ID = False                                                                                                                                                                                                                                                                                                                                                         
INFO  @ Thu, 29 Jun 2023 02:54:01: Processing GTF files ...                                                                                                                                                                                                     
INFO  @ Thu, 29 Jun 2023 02:54:01: Building gene index .......                                                                                                                                                                                                  100000 GTF lines processed.                                                                                                     
200000 GTF lines processed.                                                                                                     
300000 GTF lines processed.                                                                                                     
400000 GTF lines processed.                                                                                                     

INFO  @ Thu, 29 Jun 2023 02:55:10: Done building gene index ......                                                                                                                                                                                              
INFO  @ Thu, 29 Jun 2023 02:55:13: Building TE index .......                                                                                                                                                                                                    
INFO  @ Thu, 29 Jun 2023 02:58:43: Done building TE index ...... 

Reading sample files
olivertam commented 1 year ago

Hi,

Sorry for my slow response. It looks like it is reading in the file(s). Is it still stuck? If so, could you give me an excerpt of your BAM files (including the header), and I will see if it also hangs on our system.

Thanks, and my apologies.

Manjoura98 commented 12 months ago

Thank you for your cooperation. ill paste it here even if its binary: but also i can see a difference between my samples and the test samples shared on your lab web page. This is what I get with head -1 sample.bam

&ӚDc%03�d03��CҨH���S,7BEP��"]���E�tgA��g��=gTp�J��;��q��S
                                                         C�e:s��y�8I�<�t~r��hV��r;ɞYz2�T�,N�0�H�W�yxg�*��KDJ�q�tyi1Iz$�D��:I�$�I�/j)�F�>�R1��p��eL�(���q4:��65QB`�b޸h�Q\0�+�y�%h�i�Za!\LF

from one of your samples I get:

]*.�%nY��f������\@jZF�5U��Qt]׌RNZ�aH���%���ߒk�`��0�� ��6�g��o�n��^����,�V{�&L�7�
              /8���_ǐ$��׋�g0�!�Ťc�h��TJ�lK�&��n=�z�5$5��quض��{{�+�w[ְ=��s�>�JM�)'�V���j5������,q��0�]/�bW�X<�#ѝڢ�()

Thank you

olivertam commented 12 months ago

Hi,

Thank you. Unfortunately, the binary portion is not that informative. You can get the text output by doing the following:

$ samtools view -h sample.bam | head -n 10000 > sample_excerpt.sam

Samtools can be installed here.

Thanks.

Manjoura98 commented 12 months ago

Thank you for the response, I attach the file right here. THank you sample_excerpt.zip

olivertam commented 12 months ago

Hi,

There are a couple of things that I noticed: 1) Your chromosome names are not using the UCSC nomenclature (e.g. chr1) that is typical for the mm10/mm39 genome builds. Instead, they are using NCBI Refseq nomenclature (e.g. NC_000067.7) 2) Based on the chromosome nomenclature (e.g. NC_000067.7), the genome build corresponds to the GRCm39 build. Thus, mm10 would not be the correct genome build for the gene or TE annotation.

Thus, there are two options: 1) Realign your samples to a genome FASTA sequence that uses either the UCSC (mm39), Ensembl (GRCm39) or GENCODE nomenclature (GRCm39). You should then use the gene and TE GTF corresponding to that genome build (we provide TE GTF for mm39, GRCm39 Ensembl and GRCm39 GENCODE) 2) You can use the TE GTF that has the GRCm39 NCBI chromosome names. You will also need to use a gene annotation that uses the NCBI chromosome names, which you can check by looking at the first column of the GTF file.

Please let me know if you have any questions.

Thanks.

Manjoura98 commented 12 months ago

Thank you sir, I was doubting this too. That's very helpful, this will solve the problem. Your help is very much appreciated

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days