Some of the motif databases (ENCODE, HOMER) provide actual position frequency matrices (PFMs) rather than count matrices. This isn't a big issue in itself, but we will need to build in checks for the modules that expect count matrices that are then converted into position frequency matrices and then into position weight matrices / position specific score matrices (PWMs/PSSMs, they are the same thing).
In particular, summarize.py and motifs.py do this. Options include:
Changing it so our motif format simply uses PFMs and convert the motif files to them during the conversion process (not difficult and reduces need for pseudocount option for certain programs).
Add checks to summarize.py and motifs.py that check for this for first motif in file (if all elements of line start with '0.', it's a PFM or if all elements of line are INTs, it's a count matrix).
I'm leaning towards the former currently, as it's likely a cleaner solution.
Some of the motif databases (ENCODE, HOMER) provide actual position frequency matrices (PFMs) rather than count matrices. This isn't a big issue in itself, but we will need to build in checks for the modules that expect count matrices that are then converted into position frequency matrices and then into position weight matrices / position specific score matrices (PWMs/PSSMs, they are the same thing).
In particular,
summarize.py
andmotifs.py
do this. Options include:summarize.py
andmotifs.py
that check for this for first motif in file (if all elements of line start with '0.', it's a PFM or if all elements of line are INTs, it's a count matrix).I'm leaning towards the former currently, as it's likely a cleaner solution.