TF name/gene symbol discordance

Many motif databases use the TF name rather than the gene symbol (e.g. PU.1 rather than SPI1). Some of them have tables that allow for these to be easily converted to the actual gene symbol (HOMER, for example), but others may not. Need to manually check how prevalent this is and make sure users recognize this potential problem. The gene symbol is what's used to cross-reference the gene expression data, so it's necessary.

For the HOCOMOCO set, I used such a table provided by the database to swap the UNIPROT protein IDs with the gene symbols via a script, and then manually checked each name individually to ensure the same as the gene symbols in our expression file.

Potential ways around this:

Curate the datasets and provide them as package data. Most straightforward method. Could still allow users to provide their own motif lists, but they likely wouldn't be able to use the tf_expression module with any amount of reliability unless they manually curated their gene expression and motif file themselves to match up.
Try to utilize the HGNC's REST API to guess the GENE symbol from whatever TF name is given. More involved and more assumptions made. I think the first approach is probably better.

j-andrews7 / VAMPIRE

TF name/gene symbol discordance #41