j-andrews7 / VAMPIRE

Variant and Epigenetic anNotation for Underlying Significance and Regulation
MIT License
3 stars 0 forks source link

TF name/gene symbol discordance #41

Open j-andrews7 opened 7 years ago

j-andrews7 commented 7 years ago

Many motif databases use the TF name rather than the gene symbol (e.g. PU.1 rather than SPI1). Some of them have tables that allow for these to be easily converted to the actual gene symbol (HOMER, for example), but others may not. Need to manually check how prevalent this is and make sure users recognize this potential problem. The gene symbol is what's used to cross-reference the gene expression data, so it's necessary.

For the HOCOMOCO set, I used such a table provided by the database to swap the UNIPROT protein IDs with the gene symbols via a script, and then manually checked each name individually to ensure the same as the gene symbols in our expression file.

Potential ways around this:

j-andrews7 commented 7 years ago

I now expect to use mygene's query command for this, which should simplify things.

Best idea would likely be to query expression file and add columns for Ensembl, Entrez, and gene symbol IDs, then query the motif file as it's iterated through and find if any of those three IDs for the motif match and filter it that way. Will actually likely simplify things and take care of some of the issues with expression stuff in general.