glam2 for variable length motif discovery

seb-mueller commented 8 months ago

First of all thanks for this fantastic package! In addition to streme and meme, I was wondering if it is possible to somehow also Glam2? I'd imagine this could be rather straightforward to achieve since it is part of the meme install (glam2) which this package relies on. However I haven't found the function to do so. Is it possible to use it out of the box or get it to work by tweaking some of the available code? Thanks!

snystrom commented 8 months ago

Hey Seb,

There's no builtin way to do this in memes today. However, it could be added. In general, my philosophy has been to wait until someone asks before adding a tool if it's one I don't use.

I don't use this tool much, so if you could share some example inputs & outputs from the cli that would help a lot to make sure the function can parse all the relevant outputs. Taking a look at the meme website, I see the html and txt file outputs. It looks like the meat of the data lives in the rather gnarly .txt format. This will unfortunately need a custom parser, so will be more challenging to implement. If you run this tool locally does it return an XML file or anything else in the output directory? If so, mind attaching one for reference? I have had luck avoiding writing txt parsers by reverse engineering the xml outputs instead.

Full disclosure, I no longer maintain this as part of my day job, so progress has been much slower than I'd like, but I have built up a batch of work that needs doing on the repo, so can tackle this one, too.

snystrom commented 8 months ago

If you were interested in implementing this feature yourself, I would happily provide advice & consider any PR's if you wanted to make them. Just let me know!

seb-mueller commented 8 months ago

Thanks a lot for your feedback. I didn't realize it was an entirely different output compared to meme/streme, in that case it is not as easy as I naively hoped. I actually didn't use the glam2 cli just yet, since I hoped the package could save me dealing with it. However, I've just dabbled with glam2 v5.5.4 on the following fasta file (just the first few lines):

>sly-miR156d-5p
UGACAGAAGAGAGUGAGCAC
>sly-miR5303
UUUUUGAAGAGUUCGAGCAAC
>sly-miR5300
UCCCCAGUCCAGGCAUUCCAAC
>sly-miR5304
UCAAUGCUACAUACUCAUCCC
>sly-miR9469-3p
AUUCGGUCUUCUUAUGUGGAC

The command glam2 n tmp.fa produced a couple of eps (the motifs) as well as glam2.txt with the following content (not complete since the motifs are all the same format):

GLAM2: Gapped Local Alignment of Motifs
Version 1056

glam2 n tmp.fa
Sequences: 5
Greatest sequence length: 22
Residue counts: a=28 c=28 g=21 t=28 N=0

Score: 10.9989  Columns: 2  Sequences: 5

                **
sly-miR156d- 19 ac 20 + 3.25
sly-miR5303  20 ac 21 + 3.25
sly-miR5300  21 ac 22 + 3.25
sly-miR5304  13 ac 14 + 3.25
sly-miR9469- 20 ac 21 + 3.25

                ac

 a  c  g  t Del Ins Score
 5  0  0  0   0      5.53
                  0 -0.0655
 0  5  0  0   0      5.53

Score: 10.9989  Columns: 2  Sequences: 5

                **
sly-miR156d-  3 ac  4 + 3.25
sly-miR5303  20 ac 21 + 3.25
sly-miR5300  21 ac 22 + 3.25
sly-miR5304   9 ac 10 + 3.25
sly-miR9469- 20 ac 21 + 3.25

Which is indeed the gnarlying format you were talking about. However, glam2 does ship with a python parser (glam2psfm) which can convert it into a meme format, which might be of help: python glam2psfm glam2.txt > glam2.meme:

MEME version 4.0

ALPHABET= ACGT

strands: + -

Background letter frequencies (from dataset without add-one prior applied):
A 0.267 C 0.267 G 0.200 T 0.267

MOTIF  1

BL   MOTIF 1 width=2 seqs=5
letter-probability matrix: alength= 4 w= 2 nsites= 5 E= 1
 1.000000  0.000000  0.000000  0.000000
 0.000000  1.000000  0.000000  0.000000

MOTIF  2

BL   MOTIF 2 width=2 seqs=5
letter-probability matrix: alength= 4 w= 2 nsites= 5 E= 1
 1.000000  0.000000  0.000000  0.000000
 0.000000  1.000000  0.000000  0.000000

MOTIF  3

BL   MOTIF 3 width=2 seqs=5
letter-probability matrix: alength= 4 w= 2 nsites= 5 E= 1
 1.000000  0.000000  0.000000  0.000000
 0.000000  1.000000  0.000000  0.000000

Would this pose a suitable format? If yes, I could try to create a PR to implement this as a function with some handholding (since I don't know how to best tackle this yet).

snystrom commented 8 months ago

Nice find! It looks like there is quite a bit of information loss on glam2psfm unfortunately unless all you cared about were the motifs themselves (without the associated score data, etc.). I think ultimately, we would want to extract all relevant information.

A good example implementation of a function wrapper is over in R/streme.R. I use cmdfun to make a nice cli wrapper, parse the cli help text, & catch warnings, then write separate ingest methods for the output files, ultimately returning a universalmotif_df where possible.

I think I can tackle this soon though, but if you wanna take a swing, go for it.

seb-mueller commented 8 months ago

Indeed score data etc. would be important. I'll have a look at your linked sources, I hope I can get my head around it. Regardless, maybe one path forward could be to tweak glam2psfm to include the scores etc. I couldn't find the source code only, hence I copied it below (python script):

#!/opt/conda/bin/python

# Read glam2 output, and write it in MEME's PSFM format.  This can be
# used as input to Tomtom.

import fileinput, string

def do_letter_counts(line):
    words = line.split()[2:-1]
    letters = [i.split('=')[0] for i in words]
    counts = [i.split('=')[1] for i in words]
    counts = list(map((lambda x: int(x) + 1e-100), counts))    # avoid divide by zero
    tot = sum(counts)
    probs = [i * 1.0 / tot for i in counts]
    probs = ['%.3f' % i for i in probs]
    print()
    print('ALPHABET= ' + (''.join(letters)).upper())
    print()
    print('strands: + -')
    print()
    print('Background letter frequencies (from dataset without add-one prior applied):')
    # avoid "generator expression", for stupid old versions of python:
    print(' '.join([ i.upper() + ' ' + j for i, j in zip(letters, probs)]))
    print()

def write_matrix(count_matrix, motif_number, tot_seq_num):
    width = len(count_matrix)
    nsites = sum(count_matrix[0])
    alength = len(count_matrix[0]) - 1
    print('MOTIF  ' + str(motif_number))
    print()
    print('BL   MOTIF ' + str(motif_number), end=' ')
    print('width=' + str(width), 'seqs=' + str(tot_seq_num))
    print('letter-probability matrix:', end=' ')
    print('alength=', str(alength), 'w=', str(width), 'nsites=', str(nsites), 'E= 1')
    for counts in count_matrix:
        counts.pop()  # remove the deletion count
        counts = list(map((lambda x: x + 1e-100), counts))    # avoid divide by zero
        tot = sum(counts)
        probs = [i * 1.0 / tot for i in counts]
        probs = ['%f' % i for i in probs]
        print(' ' + '  '.join(probs))
    print()

matrices = []
state = 0

for line in fileinput.input():
    if state == 0:
        if line.startswith('Version'):
            print('MEME version 4.0')
            #print 'GLAM2 version ' + line.split()[1] + '\n'
        elif line.startswith('Sequences'):
            tot_seq_num = line.split()[1]
        elif line.startswith('Residue counts'):
            do_letter_counts(line)
        elif line.rstrip().endswith('Del Ins Score'):
            matrices.append([])
            state = 1
    else:
        words = line.split()
        if len(words) > 2:
            words.pop()  # remove the score
            counts = list(map(int, words))
            matrices[-1].append(counts)
        elif not words:
            state = 0

for i, m in enumerate(matrices):
    write_matrix(m, i+1, tot_seq_num)

snystrom / memes

glam2 for variable length motif discovery #112