loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
MIT License
188 stars 40 forks source link

Motif enrichment analysis on footprint #29

Closed czfoo closed 4 years ago

czfoo commented 4 years ago

I would like to do a genome-wide search for enriched motifs after performing the footprinting on ATAC-Seq data. I am aware that there is a tool called HOMER to do that. However, HOMER only accepts a bed or peaks file as an input, while the output of ScoreBigWig is a BigWig file. Is it possible to convert the BigWig file to a bed file so this analysis can be performed? Thanks.

msbentsen commented 4 years ago

Hi, unfortunately there is no direct way to convert the footprint bigwigs to bed-files. If you have ATAC-seq two conditions, I would recommend you to use TOBIAS BINDetect, as this also runs a variation of motif enrichment based on footprint scores. However, if you are comfortable with other bioinformatics tools and a bit of hacking, you might also be able to extract high-scoring footprints in .bed-format by:

  1. Converting the bigwig-file to a .bedgraph using UCSC's bigWigToBedGraph
  2. Filtering the positions with high footprint scores per peak or genomewide with awk/bash
  3. Converting the single-bp .bedgraph to bed using 'bedtools merge'
  4. Maybe applying further filtering such as minimum footprint width for each footprint region

However, I have not tried this, and I don't know how well it works as an input for other tools. But I hope it helps you out nonetheless!

czfoo commented 4 years ago

Thanks for your reply. I have two conditions for ATACSeq, so I'm guessing I can use BINDetect. But my understanding is that I need to know the specific transcription factor whose motifs I am interested in searching for, and I cannot do a global search for all motifs, right?

Also, I have another technical question. I'm currently running ATACorrect with 28 cores and 1TB RAM (because it ran out of memory on regular 120GB RAM node), and while your documentation says that it should take about 3 min to complete, mine has been running for over a day now and is still not complete. Is that normal, or is there anything wrong with my input?

msbentsen commented 4 years ago

For TOBIAS BINDetect, you give the option --motifs, which is a file of transcription factor motifs of interest - but this can also be a large list of TFs (such as the full JASPAR database). So in that sense, the analysis does run a global search of all motifs.

For your ATACorrect run... ugh yea that doesn't sound nice. The 3 minutes is just for the example command, and a real run on a full genome will take longer (30 min->hours). However, it should not use that much RAM! Can you give me a little more information about your run? Do you have a lot of peaks? Are you running in multiprocessing? The stdout log should also tell you at what point the tool is stuck, and that would maybe also be helpful to see what is going on.

czfoo commented 4 years ago

OK, I think I figured out what the problem is. Thanks for your help. It seems that the bed file I used for the input was not the peak file, but the reads/fragments/intervals output from Genrich. I used Genrich instead of MACS because it has a mode that is optimised for ATAC-Seq, while MACS is only optimised for ChIP-Seq. It was previously stuck on the "Processing input/output regions" step. I have since re-run it using the NarrowPeak output file instead, and it appears to be working properly this time.

However, the RAM usage is still high (247.41 GB). This was run using the mm10 mouse genome, according to stats generated by ATACCorrect, there were a total of 80604 peaks, and 439037812 reads in my input files. The input BAM file was actually quite large, with a size of 15.54 GB. However, this time, it only took 1 1/2 hours to complete using 28 cores, instead of still being stuck after 2 days as happened previously.

czfoo commented 4 years ago

One more question by the way. Currently, I have separate peak files for each of my experimental conditions. How would you recommend merging them into a single peak file so I can run BINDetect? Thank you.

msbentsen commented 4 years ago

With regards to the peaks, the algorithm is optimized to work on peak-regions (meaning regions with significant signal in comparison to a background), so I cannot promise the effects when running it with a whole-genome peak-file. I think it should generally be avoided, as footprints are also only found in open chromatin (peaks). Any measures outside of peaks should therefore be treated as 0.

For merging the peak-files, I would suggest using bedtools to do: cat mypeaks1.bed mypeaks2.bed | bedtools sort | bedtools merge This should also be the peak-file that you use for ATACorrect and ScoreBigwig, to make sure that all condition bigwig-files have signal from the same regions.

For the RAM, I am not sure why it is that high actually. Can you tell me a bit about the system you are on? MacOS/linux? Also, can you see if the RAM usage is continually rising or somehow spiking? Thank you for the report!

czfoo commented 4 years ago

Unfortunately, I don't have data on how the RAM usage varies, since I submitted it to the university's computing cluster to run. Jobs like these can't be run on my computer because I simply don't have enough RAM, and my computer will crash if I try. The university's computing cluster runs on Linux, and I submit jobs to them using the slurm scheduler. According to the stats, when running on 28 cores, the CPU efficiency is 26.64%. I wonder if the high RAM usage is due to the fact that I have 3 biological replicates, one of which has a depth of 100 million reads, while the other two have 50 millions reads. And the data is paired-end, so that's 400 million reads in total.

msbentsen commented 4 years ago

It would make sense for the RAM to be high with large amounts of data, but there might also be some system-specific settings on how the cluster handles multiprocessing and memory. If you can't get it running at all, I would try to set the --split option higher (like 1000+), as this decreases the computational load per job - so this might help. Unfortunately, I don't have any other "solution" to it right now, but I will have a look at which parts of the algorithm are RAM critical. Thank you for making me aware of it!

czfoo commented 4 years ago

I managed to get it to run successfully. It's just that I had to use a special high-memory node (1TB RAM, 28 cores) instead of the regular compute nodes (120GB RAM, 24 cores). It's a good thing that the university has high memory nodes that I can use when I need to, but of course, if someone else from another institution doesn't have access to one of those, it could be problematic for them.

msbentsen commented 4 years ago

I agree, that is not common hardware, so I hope to solve this in the future. I will close this issue but let me know if anything else pops up!