Closed bytewife closed 1 year ago
@jmschrei It appears that tfmodisco implements subclustering of seqlet clusters. How should we handle these w.r.t. writing motifs to the MEME files?
Example (modisco_results.CWM.meme
):
MEME version 5
ALPHABET= ACGT
Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25
MOTIF pattern_0
letter-probability matrix: alength= 4 w= 30 nsites= 1
0.000966 0.001605 0.001199 0.000856
0.000863 0.001477 0.000871 0.001402
0.001114 0.000545 0.001213 0.000839
0.001036 0.001990 0.001227 0.000744
0.002566 0.000326 0.000082 0.000209
-0.000404 0.000548 -0.000335 0.003904
0.000359 0.000382 0.001446 0.000812
0.000310 0.002486 0.000013 0.000915
-0.001650 0.000611 -0.000274 -0.000931
-0.002431 -0.001015 -0.001942 -0.000258
-0.001104 -0.000323 -0.000156 -0.001906
-0.000923 0.000257 0.000545 -0.001862
-0.002585 -0.001050 0.000504 -0.002165
-0.006278 -0.002906 -0.001575 -0.004529
-0.006525 -0.005682 0.000230 -0.003893
0.010032 -0.002446 0.000890 -0.004159
-0.000209 -0.000327 -0.000078 0.061095
-0.000307 0.000133 0.038049 -0.002317
0.046923 -0.000388 -0.000595 -0.000854
-0.006760 -0.001430 0.002318 -0.004597
-0.000112 0.000047 -0.000221 0.058326
-0.000075 0.061763 0.000000 0.000003
0.059826 -0.000072 -0.000006 -0.000103
-0.002650 -0.001453 -0.006538 0.008228
-0.003757 -0.000018 -0.004384 -0.007134
-0.000943 0.003998 -0.000030 -0.000349
-0.003023 0.001721 -0.001096 -0.001723
-0.001865 0.001419 0.000822 -0.001262
-0.002786 -0.001106 -0.000543 0.000054
0.000729 0.000193 0.001019 -0.000507
MOTIF pattern_1
letter-probability matrix: alength= 4 w= 30 nsites= 1
0.001049 0.001619 0.000427 0.000511
-0.000200 0.001497 -0.000001 0.001158
-0.000498 0.000407 0.000163 -0.000873
0.000389 0.000782 0.001914 0.000335
-0.004773 -0.001553 -0.003430 -0.004258
-0.004601 -0.001412 -0.000612 -0.003126
-0.005268 -0.002778 -0.003182 -0.001171
-0.002875 -0.001393 -0.001984 0.028612
0.027435 -0.002664 0.001364 -0.002392
0.065485 -0.000058 -0.000566 -0.000410
-0.003637 -0.002039 -0.003270 0.037624
-0.005678 -0.004789 -0.007113 0.000488
0.006520 -0.002398 -0.000639 -0.002607
0.001827 -0.003912 -0.004117 -0.003664
0.010983 -0.004025 -0.004508 -0.001473
0.007936 -0.003902 -0.003507 -0.001718
-0.001166 0.009161 -0.004876 -0.001414
-0.001216 0.089566 0.000095 0.000006
0.063222 0.000122 0.000036 -0.000758
-0.001174 -0.003339 0.028237 0.000797
0.024157 -0.000106 -0.001504 -0.000790
-0.002020 -0.000069 -0.000733 0.048505
-0.000895 -0.001108 0.059248 -0.003064
-0.001797 -0.002617 0.000276 0.005361
-0.005389 -0.001219 -0.004015 -0.004279
-0.002335 0.000955 -0.000330 -0.001823
-0.002292 0.000080 -0.000658 -0.000764
-0.002291 0.000086 -0.000120 -0.001799
-0.001238 0.002040 0.000819 -0.000516
0.000059 0.000320 -0.000112 -0.000813
MOTIF pattern_2
letter-probability matrix: alength= 4 w= 30 nsites= 1
0.000898 0.002598 0.000942 0.000500
0.001459 0.002100 0.000154 0.000675
0.000583 0.001926 0.001299 0.000154
0.000785 0.000994 0.001907 -0.000990
-0.000691 0.000540 0.000071 -0.001234
0.000047 0.000679 -0.000569 -0.001471
-0.000454 -0.000009 0.000828 -0.000837
-0.001950 0.000279 0.003098 -0.002683
-0.001444 -0.001653 -0.000006 -0.002776
-0.004442 0.002268 -0.002209 -0.005088
0.004961 -0.002721 0.008938 -0.001852
-0.002452 -0.002147 -0.001695 0.011936
0.000027 0.045844 -0.000629 0.000031
0.000057 0.000000 0.000000 0.043926
0.000000 0.000000 0.076969 -0.000147
-0.000493 -0.003102 0.009632 -0.001143
-0.000913 -0.003879 -0.002797 0.001645
-0.000913 -0.002719 -0.003337 0.005109
0.000368 -0.002550 -0.001215 0.007616
0.000958 -0.003784 -0.001084 0.000913
-0.000987 -0.001050 -0.001777 0.005295
0.006877 -0.004218 0.003535 -0.004070
-0.000120 -0.000042 0.000193 0.056402
-0.000554 0.041489 0.000058 0.000563
0.023777 -0.000540 -0.000619 -0.001303
0.000133 -0.001554 0.000041 -0.001371
-0.002286 -0.001094 -0.002247 -0.002995
-0.002717 -0.000872 -0.001313 -0.002747
-0.000462 0.001027 0.000507 0.000987
0.000069 0.000760 0.001237 -0.000024
@jmschrei Could you also verify that the probability matrices calculated for each datatype is as expected?
Let's ignore subpatterns for now. We can add support for that afterwards. Probably it would just be writing the patterns out first and then the subpatterns afterwards.
🥳