Issue with t3e.py script

MatteoT23 commented 1 year ago

Dear Michelle,

Thank you for working on developing this tool, it looks quite interesting! I'm very curious to try it out on our ChIP-seq datasets from mESCs, and I started having a look at it. Everything seems to be working fine up to the calculation of the input-based background probabilities. However, it seems that the analysis gets stacked when I attempt to run the t3e.py script. A "sample_background.txt" file is created, but the file remains empty and apparently the script keeps running for over ~2 days (no error messages displayed so far).

Is this normal or is there something I should optimise? I'm running the script using the example parameters (i.e., --iter 100; with samples down sized to 20M reads; --readlen 84) in a HPC cluster environment with 4 nodes, 64 cores and 500Gb dedicated to the job.

Let me know if I'm missing something here, I'd be very curious to apply T3E to our data! Looking forward to your feedback Cheers, Matteo

michelleapaz commented 1 year ago

Dear Matteo,

Thank you for your comment and for the interest using T3E!

Since you are working with ChIP-seq from mouse, I would strongly recommend you to set the parameter filter to 1 (in the same file you set the number of iterations).

The reason for this is that mouse genome contains regions of extremely high signals which are not impacting on the quantification of transposable elements. Those regions (mainly centromeric and telomeric) are rich in major satellites and simple repeats and it can be very computational expensive. Thus, it is better filtering them out (see Supplementary Fig. S6 of T3E paper for further details).

Therefore, I would recommend you the following:

Check if the python script t3e.py is actually running using the command "top" and how much RAM it is using
Use the filter parameter (set it to 1) - the default is 0
Test it using only 1 iteration (it is going to be faster and will provide you an idea if the tool is working for your dataset

Best,

Michelle

MatteoT23 commented 1 year ago

Dear Michelle,

Thanks a lot for your quick reply! I'm now running the analysis as you suggested, and everything seems to be working fine up to the "Annotate the high signal regions in regards of repeat sequences" step (line 184 in main.sh). Here, I get the following error: Error: Error: stat() failed on: /MY_PATH/repeats/rmsk_hg38.bed_filtered_grouped.bed

Is it somehow failing to create this annotation? Any idea on why this could be the case?

Best, Matteo

P.S.: After getting this error with my own data, I went back to the "T3E_examples" bam files to test the run with the parameters you suggested (see below) but I get the same error. species hg38 iterations 1 alpha 0.05 enrichment 1.0 filter 1

P.P.S.: I'm attaching a list with packages installed in the environment I'm using for T3E. Thanks! my_T3Eenv.txt

michelleapaz commented 1 year ago

Dear Matteo,

Thank you for your reply.

Considering your list of parameters, I just noticed that you are using human genome (hg38) as reference, instead of mouse (mm10).

> species hg38 => mm10
> iterations 1
> alpha 0.05
> enrichment 1.0
> filter 1

It seems like the problem is the annotation's name. Thank you for reporting it. You can create your own annotation by merging adjacent and overlapping TE copies of the same family/subfamily or using the one I provided for mouse in T3E_examples (rmsk_mm10.bed).

I have changed the annotation's name in the source code (modifications in L180-183):

#input1="$REPEATS";
#input2="${input1##*/}";
#input3="${input2%.txt}";
ANNOTATION="$REPEATS";

I have tested the code for the reported issue and it worked fine. Please, let me know if after changing the annotation's name for the filtering, it works well for you!

Best regards,

Michelle

P.S. You can check the status of t3e.py by checking the created log file (log_SAMPLENAME.txt) in the same folder.

michelleapaz / T3E

Issue with t3e.py script #1