loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
MIT License
186 stars 39 forks source link

BINDetect not giving out error when the motif file is "deformed" #248

Closed johannesnicolaus closed 1 month ago

johannesnicolaus commented 8 months ago

Might be a continuation of issue #78. When I tried to run BINDetect using "pfm" motif file created by gimmemotifs, i get a problem where

The pfm file looks something like:

>GM.5.0.Sox.0001
0.7213  0.0793  0.1103  0.0891
0.9259  0.0072  0.0062  0.0607
0.0048  0.9203  0.0077  0.0672
0.9859  0.0030  0.0030  0.0081
0.9778  0.0043  0.0128  0.0051
0.1484  0.0050  0.0168  0.8299
>GM.5.0.Homeodomain.0001
0.8870  0.0000  0.0178  0.0951
0.1156  0.2033  0.6629  0.0181
0.0017  0.7452  0.0809  0.1722
0.0011  0.0003  0.0003  0.9983
0.0026  0.0141  0.9721  0.0111
0.0000  0.0189  0.0054  0.9758
0.0006  0.9983  0.0006  0.0006
0.9170  0.0140  0.0046  0.0644
0.2228  0.2421  0.3300  0.2051
0.3621  0.1054  0.2208  0.3116
0.5727  0.0104  0.1741  0.2428

For example, I have 1796 motifs in the pfm file, but I got the following warning:

2023-12-16 10:23:46 (1569572) [INFO]    Reading motifs from file
2023-12-16 10:23:47 (1569572) [INFO]    - Read 5531 motifs
2023-12-16 10:23:47 (1569572) [WARNING] The motif output names (as given by --naming) are not unique.
2023-12-16 10:23:47 (1569572) [WARNING] The following names occur more than once: ['_']
2023-12-16 10:23:47 (1569572) [WARNING] These motifs will be renamed with '_1', '_2' etc. To prevent this renaming, please make the names of the input --motifs unique

And I got results with the directories named as such:

__1     __1413  __1829  __2243  __2659  __3073  __3489  __541  __957

or

GM.5.0.Sox.0001_GM.5.0.Sox.0001
GM.5.0.Sox.0002_GM.5.0.Sox.0002
GM.5.0.Sox.0003_GM.5.0.Sox.0003
GM.5.0.Sox.0004_GM.5.0.Sox.0004
GM.5.0.Sox.0005_GM.5.0.Sox.0005
GM.5.0.Sox.0006_GM.5.0.Sox.0006
GM.5.0.Sox.0007_GM.5.0.Sox.0007
GM.5.0.Sox.0008_GM.5.0.Sox.0008
GM.5.0.Sox.0009_GM.5.0.Sox.0009

Maybe this pfm file is not a standard pfm file, but maybe it would be nice if BINDetect gives an error that the motif file is not standard.

My current workaround is to run chen2meme, because it seems that it is a chen motif file. Now BINDetect seems to work fine.

msbentsen commented 8 months ago

Hi @johannesnicolaus

Thank you for this issue - indeed it looks related to #78. There seems to be a bug in the reading of these files using biopython, which creates additional "empty" motifs with "_"-names. We have now changed it to manually parse and check the length, and will then write an error in case a deformed motif is found: image

The code is not thoroughly tested yet, but you can have a look already by installing the version directly from the dev branch as: pip install git+https://github.com/loosolab/TOBIAS@dev

After testing, the functionality will be included in the next version of TOBIAS. Hope that helps 🙏

johannesnicolaus commented 8 months ago

Perfect, thanks so much!

github-actions[bot] commented 1 month ago

No activity for at least 30 days. Marking issue as stale. Stale issues are closed after one week.