jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
183 stars 27 forks source link

About ATACseq mode (Large files, Warnings and p/q values) #9

Closed POTsnake closed 5 years ago

POTsnake commented 5 years ago

First of all, I do appreciate your developing and revealing the very elaborate and useful code for peak calling. I am trying to use Genrich to analyze my ATAC-seq data sets with my MacBook Pro. I executed the following command: $ Genrich -t A -o B -j -y -r -e chrM -v A was "samtools sort -n" bam files, and originally cram files. When A was relatively small (4.45GB or 8.84GB), Genrich completed the processes after giving many warning messages. However, when I treated bigger files like 17GB, the the processes were terminated by "Killed: 9" after showing lots of Warnings! for 2.5hrs. I saw two kinds of warnings from my treatment: 1) Read xxx prevented from extending below 0 on chr yyy 2) Read xxx prevented from extending past ... on chr yyy Most of the problematic chromosomes were chrUn_’something’.

Do you have any ideas to deal with the bigger files? Should I remove the reads aligned on chrUn_’something’ from the input files? Also, completed analyses with the default setting identified less peaks than expected. What is the reasonable range of values of p and q options for ATAC-seq analyses?

Again thank you so much for your devotion to the field and I would be even more grateful if you could help me on these issues.

jsh58 commented 5 years ago

Thanks for the questions. Here are the answers/suggestions:

weiweivi commented 5 years ago

I had the same problem. I tried running without -r, and it worked! I wonder whether I should remove the PCR duplicates before Genrich (by Picard maybe?). Thank you very much for the helpful package and advice!-Weiwei

jsh58 commented 5 years ago

@weiweivi,

Again, running without -r is not recommended. You can certainly try Picard, although I do not think you will get far, if memory is your limitation. In my experience MarkDuplicates uses about twice as much memory as Genrich (and it doesn't even call peaks!).

POTsnake commented 5 years ago

Hi both,

Thank you so much for your comments. I had a discussion with my colleague. For the memory problem (related to my first question), instead of removing -r I am going to try:

  1. Releasing the memory as much as possible by stopping all the other user applications, such as evernote etc, when processing big files with Genrich. FYI, my MBP has 16GB RAM.
  2. Splitting big files into parts before submitting to Genrich, possibly based on chromosomes. Merge the results of each parts for the downstream analyses.
  3. Down sizing big files by random sampling. This should be done after checking the read number that provides enough depth.

Do you think the sum of the results obtained by splitting and analysing the parts could be different from the result obtained by analysing the original file? Let's say that 10,000 peaks on each chr1-10 and chr11-22+XY were identified from the original file. Is it possible to run Genrich on the split files with chr1-10 and chr11-22+XY separately, and get the same 10,000 peaks from each? Are there any better ways of splitting than based on chromosomes?

jsh58 commented 5 years ago

The results from splitting the BAM and analyzing the parts separately will certainly be different from doing it all at once. The reason is that the background level(s) and genome length(s) will be different, even if you are very careful to modify the BAM headers for each part artificially. If your BAM has secondary alignments that span multiple chromosomes, it would be quite impossible.

I don't see any good way of splitting the BAM. I recommend finding a machine with more memory. Maybe try the cloud?

jsh58 commented 5 years ago

Just FYI, I updated the code to limit the Read xxx prevented... warning messages to 128 per input file. This should allow the program to run faster and not clog up the console, but it does not improve the memory issue.

POTsnake commented 5 years ago

Hi, jsh58. Thank you for your messages and sorry for my late reply. I switched off other memory-taking applications when running Genrich and it could process files as large as 10GB with my 16GB-RAM MBP. Also, inputs of 4 files at the same time (-t A,B,C,D) were no problem and increased identified peak numbers.

I have 4 data sets (replica x sequence lanes) for each condition like: replica1_lane1, replica1_lane2, replica2_lane1, replica2_lane2 / condition1. Which is better to input these 4 files together to Genrich or replica1s and 2s separately? At the end, I would like to have comparisons between different conditions.

jsh58 commented 5 years ago

Glad to hear it worked! With multiple input files, PCR duplicate removal is done separately for each, so the memory increase should be minimal.

If that concludes the memory discussion, I would appreciate it if you would close this issue, and then open a new issue for your question about replicates. Thanks!

POTsnake commented 5 years ago

Sure. I would also try to find other threads related to the replicates. Thank you so much for your help.