(WIP) Implement MEME output subcommand `meme`

bytewife commented 1 year ago

Fixes #23
This PR implements several ways of writing MEME files (not all of them are "canon" to the MEME spec, such as the CWM variation). The options are as follows:
- 'PFM': The position-frequency matrix.
- 'CWM': The contribution-weight matrix.
- 'hCWM': The hypothetical contribution-weight matrix; hypothetical contribution scores are the contributions of nucleotides not encoded by the one-hot encoding sequence.
- 'CWM-PFM': The softmax of the contribution-weight matrix.
- 'hCWM-PFM': The softmax of the hypothetical contribution-weight matrix.

bytewife commented 1 year ago

@jmschrei It appears that tfmodisco implements subclustering of seqlet clusters. How should we handle these w.r.t. writing motifs to the MEME files?

bytewife commented 1 year ago

Example (modisco_results.CWM.meme):

MEME version 5

ALPHABET= ACGT

Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25

MOTIF pattern_0
letter-probability matrix: alength= 4 w= 30 nsites= 1
0.000966 0.001605 0.001199 0.000856
0.000863 0.001477 0.000871 0.001402
0.001114 0.000545 0.001213 0.000839
0.001036 0.001990 0.001227 0.000744
0.002566 0.000326 0.000082 0.000209
-0.000404 0.000548 -0.000335 0.003904
0.000359 0.000382 0.001446 0.000812
0.000310 0.002486 0.000013 0.000915
-0.001650 0.000611 -0.000274 -0.000931
-0.002431 -0.001015 -0.001942 -0.000258
-0.001104 -0.000323 -0.000156 -0.001906
-0.000923 0.000257 0.000545 -0.001862
-0.002585 -0.001050 0.000504 -0.002165
-0.006278 -0.002906 -0.001575 -0.004529
-0.006525 -0.005682 0.000230 -0.003893
0.010032 -0.002446 0.000890 -0.004159
-0.000209 -0.000327 -0.000078 0.061095
-0.000307 0.000133 0.038049 -0.002317
0.046923 -0.000388 -0.000595 -0.000854
-0.006760 -0.001430 0.002318 -0.004597
-0.000112 0.000047 -0.000221 0.058326
-0.000075 0.061763 0.000000 0.000003
0.059826 -0.000072 -0.000006 -0.000103
-0.002650 -0.001453 -0.006538 0.008228
-0.003757 -0.000018 -0.004384 -0.007134
-0.000943 0.003998 -0.000030 -0.000349
-0.003023 0.001721 -0.001096 -0.001723
-0.001865 0.001419 0.000822 -0.001262
-0.002786 -0.001106 -0.000543 0.000054
0.000729 0.000193 0.001019 -0.000507

MOTIF pattern_1
letter-probability matrix: alength= 4 w= 30 nsites= 1
0.001049 0.001619 0.000427 0.000511
-0.000200 0.001497 -0.000001 0.001158
-0.000498 0.000407 0.000163 -0.000873
0.000389 0.000782 0.001914 0.000335
-0.004773 -0.001553 -0.003430 -0.004258
-0.004601 -0.001412 -0.000612 -0.003126
-0.005268 -0.002778 -0.003182 -0.001171
-0.002875 -0.001393 -0.001984 0.028612
0.027435 -0.002664 0.001364 -0.002392
0.065485 -0.000058 -0.000566 -0.000410
-0.003637 -0.002039 -0.003270 0.037624
-0.005678 -0.004789 -0.007113 0.000488
0.006520 -0.002398 -0.000639 -0.002607
0.001827 -0.003912 -0.004117 -0.003664
0.010983 -0.004025 -0.004508 -0.001473
0.007936 -0.003902 -0.003507 -0.001718
-0.001166 0.009161 -0.004876 -0.001414
-0.001216 0.089566 0.000095 0.000006
0.063222 0.000122 0.000036 -0.000758
-0.001174 -0.003339 0.028237 0.000797
0.024157 -0.000106 -0.001504 -0.000790
-0.002020 -0.000069 -0.000733 0.048505
-0.000895 -0.001108 0.059248 -0.003064
-0.001797 -0.002617 0.000276 0.005361
-0.005389 -0.001219 -0.004015 -0.004279
-0.002335 0.000955 -0.000330 -0.001823
-0.002292 0.000080 -0.000658 -0.000764
-0.002291 0.000086 -0.000120 -0.001799
-0.001238 0.002040 0.000819 -0.000516
0.000059 0.000320 -0.000112 -0.000813

MOTIF pattern_2
letter-probability matrix: alength= 4 w= 30 nsites= 1
0.000898 0.002598 0.000942 0.000500
0.001459 0.002100 0.000154 0.000675
0.000583 0.001926 0.001299 0.000154
0.000785 0.000994 0.001907 -0.000990
-0.000691 0.000540 0.000071 -0.001234
0.000047 0.000679 -0.000569 -0.001471
-0.000454 -0.000009 0.000828 -0.000837
-0.001950 0.000279 0.003098 -0.002683
-0.001444 -0.001653 -0.000006 -0.002776
-0.004442 0.002268 -0.002209 -0.005088
0.004961 -0.002721 0.008938 -0.001852
-0.002452 -0.002147 -0.001695 0.011936
0.000027 0.045844 -0.000629 0.000031
0.000057 0.000000 0.000000 0.043926
0.000000 0.000000 0.076969 -0.000147
-0.000493 -0.003102 0.009632 -0.001143
-0.000913 -0.003879 -0.002797 0.001645
-0.000913 -0.002719 -0.003337 0.005109
0.000368 -0.002550 -0.001215 0.007616
0.000958 -0.003784 -0.001084 0.000913
-0.000987 -0.001050 -0.001777 0.005295
0.006877 -0.004218 0.003535 -0.004070
-0.000120 -0.000042 0.000193 0.056402
-0.000554 0.041489 0.000058 0.000563
0.023777 -0.000540 -0.000619 -0.001303
0.000133 -0.001554 0.000041 -0.001371
-0.002286 -0.001094 -0.002247 -0.002995
-0.002717 -0.000872 -0.001313 -0.002747
-0.000462 0.001027 0.000507 0.000987
0.000069 0.000760 0.001237 -0.000024

bytewife commented 1 year ago

@jmschrei Could you also verify that the probability matrices calculated for each datatype is as expected?

jmschrei commented 1 year ago

Let's ignore subpatterns for now. We can add support for that afterwards. Probably it would just be writing the patterns out first and then the subpatterns afterwards.

jmschrei commented 1 year ago

🥳

jmschrei / tfmodisco-lite

(WIP) Implement MEME output subcommand `meme` #27