[Question] Parsing Motif Names

monovich commented 4 years ago

First off, thanks for this! Your packages are always really helpful and simple to get working.

This isn't so much a question about your package, but its downstream usage. Do you have a preferred way for parsing the motif_name assigned by HOMER to convert back to a gene name? The leading gene name seems to inconsistently match to a conventional name. Just regex parsing the first chunk of the string, you often end up with things like "Stat3+il21" or "AP-2alpha" which aren't really easily batch converted to standard ensembl gene symbols.

I'd like to check to see if any of the enriched motifs assigned to specific factors correspond with any expression change in those factors in my corresponding RNA-seq dataset, which would require intersection of the output tables. Do you know if there exists a simpler way to achieve this? Thanks!

slowkow commented 4 years ago

Thanks for the comment!

Unfortunately, I think you may need to manually check that you get the right genes, no matter what strategy you take.

Here are some tips that might be helpful:

You can paste names into the text box on this silly website I made to try to ease the pain of this task: https://quickgene.net/

You could also try using the mygene Bioconductor package with queries like "Stat3+IL21" to see if it gives you back anything useful.

I'm not really happy with the HOMER output files, and one day I would like to try using other packages instead of HOMER:

monovich commented 4 years ago

@slowkow Thanks for the quick response and excellent tips!

After spending the last couple hours manually curating my HOMER output, I can safely say I am also quite unhappy with HOMER's output files (and HOMER generally as a tool). I'll definitely check out the package alternatives you linked for future projects and will absolutely be using your site to get fetch proper gene symbols. I think you're correct that manual curation is inescapable for these output files.

I ended up realizing after posting this that the motifs are actually annotated in the html files with links to GeneCards, which definitely helps resolves many issues for motifs being named after old or ambigious gene symbols. One can manually investigate these links, but this isn't something that seems to be currently handled/utilized by any HOMER helper tool I've discovered (homerkit, marge, etc.). If integrated into the output parsed by homerkit (i.e. to get gene symbols/names as column in the output table), this would probably capture a good chunk of my described use case. Annoyingly, this still requires cross referencing another database to fetch the species specific symbols for the HOMER output. One would think that if the mouse genome is provided to a tool, the tool's gene annotation would provide mouse annotations and link out to some mouse specific gene page like those hosted by ensembl as opposed to those from a human-specific database, but I digress.

slowkow / homerkit

[Question] Parsing Motif Names #4