Open j-andrews7 opened 7 years ago
I now expect to use mygene's query
command for this, which should simplify things.
Best idea would likely be to query expression file and add columns for Ensembl, Entrez, and gene symbol IDs, then query the motif file as it's iterated through and find if any of those three IDs for the motif match and filter it that way. Will actually likely simplify things and take care of some of the issues with expression stuff in general.
Many motif databases use the TF name rather than the gene symbol (e.g. PU.1 rather than SPI1). Some of them have tables that allow for these to be easily converted to the actual gene symbol (HOMER, for example), but others may not. Need to manually check how prevalent this is and make sure users recognize this potential problem. The gene symbol is what's used to cross-reference the gene expression data, so it's necessary.
For the HOCOMOCO set, I used such a table provided by the database to swap the UNIPROT protein IDs with the gene symbols via a script, and then manually checked each name individually to ensure the same as the gene symbols in our expression file.
Potential ways around this:
tf_expression
module with any amount of reliability unless they manually curated their gene expression and motif file themselves to match up.