I would like to identify all unique kmers of a certain length in a set of input fasta sequences and then report the origin of the kmer as well as the kmer motif.
To do this, I can use the following jellyfish commands manually:
k=9 # for example
jellyfish count -m${k} -s100M -C reference.fasta
jellyfish dump -L 1 -U 1 -c mer_counts.jf
# do some sort of grep to get the headers of the original fasta sequences based on the last output
How can this be translated to python code? It looks like the dump command can be approximated. But the example you give requires a pre-constructed database file. On the other hand, the Python approximation of the count command stores things inside a HashCounter, not a Jellyfish database file...
Is it easiest to approach this problem with a bash script (i.e. using grep) or is there a straight-forward way in Python?
Hi all,
I would like to identify all unique kmers of a certain length in a set of input fasta sequences and then report the origin of the kmer as well as the kmer motif.
To do this, I can use the following jellyfish commands manually:
How can this be translated to python code? It looks like the
dump
command can be approximated. But the example you give requires a pre-constructed database file. On the other hand, the Python approximation of thecount
command stores things inside a HashCounter, not a Jellyfish database file...Is it easiest to approach this problem with a bash script (i.e. using grep) or is there a straight-forward way in Python?
Thanks a bunch! ~Lina