Open seb-mueller opened 8 months ago
Hey Seb,
There's no builtin way to do this in memes
today. However, it could be added. In general, my philosophy has been to wait until someone asks before adding a tool if it's one I don't use.
I don't use this tool much, so if you could share some example inputs & outputs from the cli that would help a lot to make sure the function can parse all the relevant outputs. Taking a look at the meme website, I see the html and txt file outputs. It looks like the meat of the data lives in the rather gnarly .txt format. This will unfortunately need a custom parser, so will be more challenging to implement. If you run this tool locally does it return an XML file or anything else in the output directory? If so, mind attaching one for reference? I have had luck avoiding writing txt parsers by reverse engineering the xml outputs instead.
Full disclosure, I no longer maintain this as part of my day job, so progress has been much slower than I'd like, but I have built up a batch of work that needs doing on the repo, so can tackle this one, too.
If you were interested in implementing this feature yourself, I would happily provide advice & consider any PR's if you wanted to make them. Just let me know!
Thanks a lot for your feedback. I didn't realize it was an entirely different output compared to meme/streme, in that case it is not as easy as I naively hoped. I actually didn't use the glam2 cli just yet, since I hoped the package could save me dealing with it. However, I've just dabbled with glam2 v5.5.4 on the following fasta file (just the first few lines):
>sly-miR156d-5p
UGACAGAAGAGAGUGAGCAC
>sly-miR5303
UUUUUGAAGAGUUCGAGCAAC
>sly-miR5300
UCCCCAGUCCAGGCAUUCCAAC
>sly-miR5304
UCAAUGCUACAUACUCAUCCC
>sly-miR9469-3p
AUUCGGUCUUCUUAUGUGGAC
The command glam2 n tmp.fa
produced a couple of eps (the motifs) as well as glam2.txt
with the following content (not complete since the motifs are all the same format):
GLAM2: Gapped Local Alignment of Motifs
Version 1056
glam2 n tmp.fa
Sequences: 5
Greatest sequence length: 22
Residue counts: a=28 c=28 g=21 t=28 N=0
Score: 10.9989 Columns: 2 Sequences: 5
**
sly-miR156d- 19 ac 20 + 3.25
sly-miR5303 20 ac 21 + 3.25
sly-miR5300 21 ac 22 + 3.25
sly-miR5304 13 ac 14 + 3.25
sly-miR9469- 20 ac 21 + 3.25
ac
a c g t Del Ins Score
5 0 0 0 0 5.53
0 -0.0655
0 5 0 0 0 5.53
Score: 10.9989 Columns: 2 Sequences: 5
**
sly-miR156d- 3 ac 4 + 3.25
sly-miR5303 20 ac 21 + 3.25
sly-miR5300 21 ac 22 + 3.25
sly-miR5304 9 ac 10 + 3.25
sly-miR9469- 20 ac 21 + 3.25
Which is indeed the gnarlying format you were talking about.
However, glam2 does ship with a python parser (glam2psfm) which can convert it into a meme format, which might be of help: python glam2psfm glam2.txt > glam2.meme
:
MEME version 4.0
ALPHABET= ACGT
strands: + -
Background letter frequencies (from dataset without add-one prior applied):
A 0.267 C 0.267 G 0.200 T 0.267
MOTIF 1
BL MOTIF 1 width=2 seqs=5
letter-probability matrix: alength= 4 w= 2 nsites= 5 E= 1
1.000000 0.000000 0.000000 0.000000
0.000000 1.000000 0.000000 0.000000
MOTIF 2
BL MOTIF 2 width=2 seqs=5
letter-probability matrix: alength= 4 w= 2 nsites= 5 E= 1
1.000000 0.000000 0.000000 0.000000
0.000000 1.000000 0.000000 0.000000
MOTIF 3
BL MOTIF 3 width=2 seqs=5
letter-probability matrix: alength= 4 w= 2 nsites= 5 E= 1
1.000000 0.000000 0.000000 0.000000
0.000000 1.000000 0.000000 0.000000
Would this pose a suitable format? If yes, I could try to create a PR to implement this as a function with some handholding (since I don't know how to best tackle this yet).
Nice find! It looks like there is quite a bit of information loss on glam2psfm
unfortunately unless all you cared about were the motifs themselves (without the associated score data, etc.). I think ultimately, we would want to extract all relevant information.
A good example implementation of a function wrapper is over in R/streme.R. I use cmdfun to make a nice cli wrapper, parse the cli help text, & catch warnings, then write separate ingest methods for the output files, ultimately returning a universalmotif_df
where possible.
I think I can tackle this soon though, but if you wanna take a swing, go for it.
Indeed score data etc. would be important. I'll have a look at your linked sources, I hope I can get my head around it.
Regardless, maybe one path forward could be to tweak glam2psfm
to include the scores etc. I couldn't find the source code only, hence I copied it below (python script):
#!/opt/conda/bin/python
# Read glam2 output, and write it in MEME's PSFM format. This can be
# used as input to Tomtom.
import fileinput, string
def do_letter_counts(line):
words = line.split()[2:-1]
letters = [i.split('=')[0] for i in words]
counts = [i.split('=')[1] for i in words]
counts = list(map((lambda x: int(x) + 1e-100), counts)) # avoid divide by zero
tot = sum(counts)
probs = [i * 1.0 / tot for i in counts]
probs = ['%.3f' % i for i in probs]
print()
print('ALPHABET= ' + (''.join(letters)).upper())
print()
print('strands: + -')
print()
print('Background letter frequencies (from dataset without add-one prior applied):')
# avoid "generator expression", for stupid old versions of python:
print(' '.join([ i.upper() + ' ' + j for i, j in zip(letters, probs)]))
print()
def write_matrix(count_matrix, motif_number, tot_seq_num):
width = len(count_matrix)
nsites = sum(count_matrix[0])
alength = len(count_matrix[0]) - 1
print('MOTIF ' + str(motif_number))
print()
print('BL MOTIF ' + str(motif_number), end=' ')
print('width=' + str(width), 'seqs=' + str(tot_seq_num))
print('letter-probability matrix:', end=' ')
print('alength=', str(alength), 'w=', str(width), 'nsites=', str(nsites), 'E= 1')
for counts in count_matrix:
counts.pop() # remove the deletion count
counts = list(map((lambda x: x + 1e-100), counts)) # avoid divide by zero
tot = sum(counts)
probs = [i * 1.0 / tot for i in counts]
probs = ['%f' % i for i in probs]
print(' ' + ' '.join(probs))
print()
matrices = []
state = 0
for line in fileinput.input():
if state == 0:
if line.startswith('Version'):
print('MEME version 4.0')
#print 'GLAM2 version ' + line.split()[1] + '\n'
elif line.startswith('Sequences'):
tot_seq_num = line.split()[1]
elif line.startswith('Residue counts'):
do_letter_counts(line)
elif line.rstrip().endswith('Del Ins Score'):
matrices.append([])
state = 1
else:
words = line.split()
if len(words) > 2:
words.pop() # remove the score
counts = list(map(int, words))
matrices[-1].append(counts)
elif not words:
state = 0
for i, m in enumerate(matrices):
write_matrix(m, i+1, tot_seq_num)
First of all thanks for this fantastic package! In addition to streme and meme, I was wondering if it is possible to somehow also Glam2? I'd imagine this could be rather straightforward to achieve since it is part of the meme install (
glam2
) which this package relies on. However I haven't found the function to do so. Is it possible to use it out of the box or get it to work by tweaking some of the available code? Thanks!