jmschrei / bpnet-lite

This repository hosts a minimal version of a Python API for BPNet.
MIT License
32 stars 14 forks source link

Calling hits with FiNeMo / marginalize? #12

Open gregorydonahue opened 4 months ago

gregorydonahue commented 4 months ago

Hi Jacob,

Thanks for all your help so far - I was able to generate negatives for my ATAC-seq data using bpnet, and then ran the following successfully:

chrombpnet fit -p JSON
chrombpnet predict -p JSON
chrombpnet attribute -p JSON

This creates the expected SHAP scores and sequence encodings in attr.npz and ohe.npz, respectively. At this point, I ran TF-MoDISco on those files, successfully generating enriched seqlets / motifs and a report:

modisco motifs -s ohe.npz -a attr.npz -n 2000 -o modisco.results.h5
modisco report -i modisco.results.h5 -o report -s report

So far, so good. All that remains is to take the five enriched motifs that TF-MoDISco discovered and call hits in the original peaks. Previously, when treating chIP-seq data, I achieved that with FiNeMo. However, in that case I had passed as input to FiNeMo a set of "valid peaks" (non-negatives) which were the output of bpnet-shap (the file is called "peaks_valid_scores.bed"). You need to supply this file for FiNeMo to give you actual genome coordinates for the motif hits (you can't give it the original peak set, it's expecting a number of peaks equal to what's in the *.npz files). Additionally, bpnet-shap gives another super-useful output: a bigWig describing the enrichment for each motif, for visualization on the browser.

Is there any way to get these out of the chrombpnet pipeline? I guess from the attribute step specifically. I had thought the answer might be to run chrombpnet marginalize, but I'm not really sure what that does - also, you need to provide it one or more motifs in MEME format, and I'm not sure how to get those for the motifs that TF-MoDISco discovered. Can you shed some light on this?

Thanks, Greg

jmschrei commented 4 months ago

Hi Greg

I don't know the internals of how FiNeMo works, maybe @austintwang can provide some insight. But if you need an aligned peak file and ohe.npz and attr.npz files, you can just change the JSON to calculate attributions genome-wide, right? Alternatively, you can generate your own .bed file by taking the original one and removing rows that aren't from the provided chromosomes. I'm sure there's some sort of awk command for that.

Are you asking for a bigwig that contains the contribution scores for peaks in bigwig format? I currently don't provide that but I guess it should be easy to add, if you think it'd be useful.

marginalize takes in a set of PWMs and yields the average prediction for when that PWM is substituted into background sequences. It's supposed to just be the marginalized effect that each individual PWM has on model predictions.

austintwang commented 4 months ago

Hi Greg,

I also recommend configuring chrombpnet attribute to generate attributions for all peaks provided. This additionally ensures you're not excluding any regions with potential signal.

Best, Austin