quadbio / Pando

Multiome GRN inference.
https://quadbio.github.io/Pando/
MIT License
106 stars 21 forks source link

Question on custom motif sets #21

Closed DanielYuhangLi closed 1 year ago

DanielYuhangLi commented 1 year ago

Hello! Been having fun formatting some independent data with this package. I wanted to ask about how to properly create a custom motif set for input.

As an example, I've downloaded pwms from other databases such as viestra and cisdb and generate motif matrixes and created motif objects. I wasn't sure how to best integrate these with the Jaspar core dataset and how to best make a custom motif to TF map.

After reading a bit about TF naming schemes, I was wondering how you go about assigning names to TFs. Brief look at the regions code for find_motif(), it seems like we match the input TF name with gene name input from the rna assay. So how do we deal with dimers/trimers - e.g. motifs where more than one TF is implicated?

Similarly, do you have any recommendation for combining datasets. In your paper, a pretty sophisticated (at least to a beginner like me) method of selecting for various motifs in the core + unvalidated Jaspar as well as other sources and was hoping you could share some of your experience in putting this together.

Thanks!

Dan

joschif commented 1 year ago

Heyhey, those are pretty good questions. Currently we assign each part of the dimer/trimer the motif individually, and don't really consider them a complex in the model. This is for sure too simplified and we have been thinking about other approaches, but it's also not entirely straight forward to deal with it properly. With good annotations, one could e.g. average the expression over all members of a complex... I'm very much open for suggestions :)

As for building your own database, I think there are two approaches and it depends what your general goal is. Either you try to get as many motifs for as many TFs as possible, or you try to find very confident ones. In the paper we really tried to get as many TF's as possible into the GRN, and therefore consulted a number of different databases and methods. Nowadays I think I would potentially go a different route and stick with the confidently annotated ones... If you want to combine multiple datasets, I would recommend defining a hierarchy of which one you trust more and choose motifs based on this hierarchy. In principal however, you could also just combine all datasets, since Pando supports having multiple motifs per TF.

Hope this helps, cheers, J

DanielYuhangLi commented 1 year ago

Cool thats helpful! I have been able to somewhat piece together a slightly wider motif set but running into some trouble with the customization with the next lists that the PWM databases come in. This is less so a pando question but a more general R question but are there any helpful resources that make this process a bit easier. I've been googling around ways to manipulate these lists but haven't come up with anything super concrete thus far.

edit - to also just add some detail, the formats that PFMs and PWMs come in from TBFSTools are difficult to edit (i.e. when I try to run an apply function on them, it states the objects are not subsettable), i've been looking at some other packages that do offer some conversions but still reading through things to see if the motif matrixes may change with the 'conversions'