NFR analysis for randomly distributed nfr regions failed using 100,000 regions,

crazyhottommy commented 8 years ago

Hi there,

I run into an error when running PARE. RESULTS.TXT only has one line. It is taking really long to run. ~6 hours for a 1.5G bam. anyway to speed it up? EDIT. I saw a p flag, but is there a way to specify how many cpus the program will use. I am running PARE on a cluster.

Thanks for your help.

Ming

....
[main_samview] region "chrX:153237367-10020489" specifies an unknown reference name. Continue anyway.
Error: Only a single file was specified. Nothing to combine, exiting.
gzip: PARE-NFR/analysis/my.bam.All.nfr.gz: No such file or directory
PARE-NFR/analysis/myy.sorted.bam.All.nfr: No such file or directory
my.sorted.bedGraph is not case-sensitive sorted at line 29707282.  Please use "sort -k1,1 -k2,2n
(END)

Check, if all required parameters and files are provided (Wed Feb 17 19:24:06 CST 2016).. done
Determine number of input bam files (Wed Feb 17 19:24:06 CST 2016).. done
Create directory structure (Wed Feb 17 19:24:06 CST 2016).. done
Populating files based on input genome, hg19 (Wed Feb 17 19:24:06 CST 2016).. done
Determine number of bases by which to extend the 3' end of reads (Wed Feb 17 19:24:06 CST 2016).. done
Optimize the threshold for max length and min number of reads in a block group (Wed Feb 17 19:30:19 CST 2016).. do
Create index of input BAM files (Wed Feb 17 20:02:19 CST 2016).. done
Compute size factor for each replicate (Wed Feb 17 20:02:19 CST 2016).. done
Retrieve size factors to normalize the expression of reads... done
Convert input bam file into bed (Wed Feb 17 20:02:19 CST 2016).. convert input bam file into bed
done
Check, if BED files are created properly.. (Wed Feb 17 20:28:16 CST 2016)... done
Predict nucleosome free regions (NFR) for each replicate (Wed Feb 17 20:29:23 CST 2016).. done
Determine common NFR between replicates (Wed Feb 17 23:40:07 CST 2016).. done
Check if size factor files already exist (Wed Feb 17 23:40:07 CST 2016).. done
Create file containing genomic coordinates within which to randomly shuffle the NFRs (Wed Feb 17 23:40:07 CST 2016
NFR analysis for randomly distributed nfr regions (Wed Feb 17 23:40:18 CST 2016).. (failed using 100,000 regions, 
Convert input bam to bigWig format to visualize in UCSC browser (Wed Feb 17 23:40:18 CST 2016).. All done. Bye
(END)

spundhir commented 8 years ago

Thanks Ming for bringing this up. There was a bug in the script invoked when only one replicate is provided as Input (I some how of the impression that its already been taken care of). I have fixed it and PARE should work fine now.

For speed up, I suggest using -p parameter (in case you are not using it). To save time, I strongly suggest running PARE on test data first (http://servers.binf.ku.dk/pare/download/test_run/).

HTH

crazyhottommy commented 8 years ago

Thanks. I am running PARE on a cluster. I have to reserve #cpu for my job. how can I specify number of cpu? If I just use p, will PARE uses all the cpus available in the node(24 cpus)? If I only reserve 6 cpus, and the program tries to use all, my job will get killed.

I also specified a peak file, and a control(Input) bam file.

pare -i my.sorted.bam  -k my-K27Ac_peaks.bed -c my-Input-T.sorted.bam -m hg19 -o my-PARE-NFR

I will use the test data set to see if it works. Thanks! Ming

spundhir commented 8 years ago

Hi Ming,

PARE now allows providing number of processors to use with -p parameter.

HTH

crazyhottommy commented 8 years ago

Hi HTH,

I got some other errors and it is hard for me to spot the reason.

Check, if all required parameters and files are provided (Thu Feb 18 11:13:35 CST 2016).. done
Determine number of input bam files (Thu Feb 18 11:13:35 CST 2016).. done
Create directory structure (Thu Feb 18 11:13:35 CST 2016).. done
Populating files based on input genome, hg19 (Thu Feb 18 11:13:35 CST 2016).. done
Determine number of bases by which to extend the 3' end of reads (Thu Feb 18 11:13:35 CST 2016).. done
Optimize the threshold for max length and min number of reads in a block group (Thu Feb 18 11:20:03 CST 2016).. done
Create index of input BAM files (Thu Feb 18 11:51:19 CST 2016).. done
Compute size factor for each replicate (Thu Feb 18 11:51:19 CST 2016).. done
Retrieve size factors to normalize the expression of reads... done
Convert input bam file into bed (Thu Feb 18 11:51:19 CST 2016).. convert input bam file into bed
done
Check, if BED files are created properly.. (Thu Feb 18 12:17:06 CST 2016)... done
Split summit file(s) into multiple smaller files (Thu Feb 18 12:18:11 CST 2016).. done
Predict nucleosome free regions (NFR) (Thu Feb 18 12:20:48 CST 2016).. my-PARE-NFR/analysis/rep0/my-K27Ac-T.sorted.bam.nfr.uniqxch
my-PARE-NFR/analysis/rep0/my-K27Ac-T.sorted.bam.nfr.uniqxchrX
......
done
Determine common NFR between replicates (Thu Feb 18 14:23:07 CST 2016).. done
Concatenate all result files into one file (Thu Feb 18 14:23:07 CST 2016).. done
Create UCSC browser file for all predicted NFRs (Thu Feb 18 14:23:07 CST 2016).. 
Error: file my-PARE-NFR/analysis/my-T.sorted.bam.All.nfr not created properly

Something didn't worked properly. Please check the log files under analysis/logs

gzip: Tmy-PARE-NFR/analysis/common/my.sorted.bam.All.nfrx*.gz: No such file or directory
my-PARE-NFR/analysis/common/my-T.sorted.bam.All.nfrx*: No such file or directory

command: blockbuster_threshold_nfr -i my-T.sorted.bam -k ,my-K27Ac_peaks.bed -o my-PARE
Determine number of input bam files (Thu Feb 18 11:20:03 CST 2016).. done
Create directory structure.. done
Populating files based on input genome (hg19)... done
Create index of input BAM files.. done
Compute size factor for each replicate.. Retrieve size factors to normalize the expression of reads... done
Compute mean length of histone peaks... done
Initialize histone peak region file based on whether input is region or summit file.. (region).. done
Determine normalized read count for histone peaks ... done
Determine normalized read count for shuffled histone peaks... done
24590
24747
Determine normalized read count for histone, TF and suffled histone peaks in each replicate
Make final output files (reads.countStat)... done
Plot distribution of normalized read counts for histone, tf and background shuffled peaks... All done. Bye

Thanks for debugging.

EDIT: I used the test data, and it finished with expected results. Not sure why my bam file failed.

Ming

spundhir commented 8 years ago

Hi Ming,

I suggest you to please reformat the input file names according to this https://github.com/spundhir/PARE#input

crazyhottommy commented 8 years ago

Hi HTH,

After changing the name of the file, it indeed worked. Although I can change the file names easily, it is not a good practice to do so. Could you please adapt the codes a bit to accept any file names, rather have _Rep1 and _Rep2 in the end?

Besides, I found only ~2400 NFR regions from my H3K27ac data which I think is relatively a small number. If I call peaks with MACS14 with pvalue of 1e-8. I got ~30,000 peaks, and they look real peaks on a browser.

What's your experiences with H3K27ac data? or how many NFRs you find on H3K4me data?

Thanks, Ming

spundhir commented 8 years ago

Done, should work now for any file name... Regarding number of NFRs, mainly depends on sequencing depth... I have seen it to range from 3000 to 15,000.... Idea begind PARE is to be high on specificity in predicting active regulatory elements. High sensistivity can be achived by already available methods such as peaks from Macs2 or chromHMM, segway etc. You may use file analysis/*nfrP which contains all the NFRs predicted since it does not filter based on FDR.

crazyhottommy commented 8 years ago

Great! Thanks for the efforts and sharing of your experiences. my files are around 20 million reads.

spundhir commented 8 years ago

Thanks to you also for using and contributing in improved PARE.

spundhir / PARE

NFR analysis for randomly distributed nfr regions failed using 100,000 regions, #2