jbkinney / logomaker

Software for the visualization of sequence-function relationships
MIT License
179 stars 34 forks source link

How to make it run for kmers ('AA', 'AT', etc) instead of single letters('A', 'T')? #13

Closed PartheshSoni closed 3 years ago

PartheshSoni commented 3 years ago

I basically want to find a sequence motif, taking into consideration kmers (length can be anything between 1 and maybe 5, 6), instead of single characters. So, I want to find the frequency of kmer at each position. Any idea how I can do that, using logomaker, or any other library?

atareen commented 3 years ago

Hi,

There's isn't a way to do this directly in logomaker, but what I can suggest is the following: you can modify the function alignmnet_to_matrix in logomaker to give you counts of multiple characters at a position, as in your task. I've pasted a link to the method below. If you do it this way, I think you'll have to update the loop starting on line 582. Once you have an updated dataframe, you can draw it with logomaker.

https://github.com/jbkinney/logomaker/blob/76aae02e03af9a5abc0880054d7d0c9a6f571c07/logomaker/src/matrix.py#L467

During development we wrote methods for drawing dinucleotide logos (e.g., see below) but these currently aren't implemented. I hope this helps. I am not sure about other libraries.

Screen Shot 2021-03-08 at 8 34 35 AM

PartheshSoni commented 3 years ago

Thanks, that makes sense. But I am giving dataframes to Logo class, which contains cols as multi-letter strings and I am getting exception that multi-letters are not supported. Removing that exception clause and some other related clauses, I am able to plot as required but then the color of the letters is in grayscale. To resolve this, I am passing a dict containing color scheme (letter->color mapping dict), and removing the _expand_color_dict(), I am able to solve the issue.

atareen commented 3 years ago

Great, I'll close the issue.