maxplanck-ie / snakepipes

Customizable workflows based on snakemake and python for the analysis of NGS data
http://snakepipes.readthedocs.io
MIT License
374 stars 85 forks source link

CSAW pipeline #946

Closed sunta3iouxos closed 5 months ago

sunta3iouxos commented 9 months ago

Hi Katarzyna, I got 2 favors to ask: A. is it possible to comment what the CSAW pipeline does exactly? When I run the CSAW on my computer with the bam files from the DNA-mapping output I could not replicate the results. There are quite a few steps that could be different. I see for example: CSAW all calc_matrix_cov_CSAW get_nearest_gene
get_nearest_transcript plot_heatmap_cov_CSAW

B. Does the final output for example reports only peaks that are associated with TSS sites? This might not be relevant for Transcription Factors or Histone Modifications that are not located at the TSS. For example H3K36me3 likes to be on the TES.

katsikora commented 9 months ago

Hi Sunta3louxos,

CSAW runs the CSAW.R and DB.R Rscripts. Briefly, genome is segmented into windows of defined size, windows overlapping the union of all peaks are further retained, read counts are collected per bam file, and window-level differential binding analysis is run. Then, a step combines adjacent windows into differentially bound regions given the associated values, and emits final "aggregated" p values and log-fold changes. FDR threshold is applied. The list of differentially bound regions is then split into "UP", "DOWN" and "MIXED" based on the aggregated log2FC.

get_nearest_transcript takes those "UP" and "DOWN" DB regions and uses bedtools to annotated each region with the nearest transcript - both ends of a transcript are treated equally, i.e. TSS and TES. get_nearest_gene annotates the above result with gene ID and gene symbol.

calc_matrix_cov_CSAW computes deepTools matrix on coverage bigwigs overlapping the "UP" and "DOWN" regions. This is unrelated to annotation with nearest transcript and gene. plot_heatmap_cov_CSAW plots the heatmap of coverage matrix calculated in the previous step, for "UP" and "DOWN" regions, respectively.

Rule "all" is just "rule all" in snakemake, once all the required targets are generated by the preceding rules, it's marked as completed.

The peaks in the report are coming from the FDR-filtered and direction-split (UP/DOWN) bed files coming out of CSAW, again irrespectively of the annotation with nearest transcript and gene.

Hope this helps, Best wishes,

Katarzyna

sunta3iouxos commented 9 months ago

I noticed that in the CSAW.R there is a sub sampling of the total using the MACS2 peaks. I could not understand if this happens prior to differential peak calling. In my opinion it would be nice to also output and the total identified differential bound sites. We could then use other peakcallers to further filter the differential bind sites. I hope this makes sense.

katsikora commented 9 months ago

If I understand correctly, this is about whitelisting random genomic intervals for those intersecting the union of MACS2 peaks. This helps reduce the number of mutliple tests in the end, such that you run the statistics only on windows that had evidence of binding in at least 1 sample. CSAW is executed after peak calling and doesn't influence MACS2 results.

For every peak caller that you choose, there will be an instance of CSAW run, taking the respective peaks as input. You can still do any intersections you wish manually.

sunta3iouxos commented 8 months ago

For every peak caller that you choose, there will be an instance of CSAW run, taking the respective peaks as input. You can still do any intersections you wish manually.

Thank you, So, you sub-sample differential peaks found by CSAW with MACS, or any other peak caller, is this correct?