gmarcais / Jellyfish

A fast multi-threaded k-mer counter
Other
463 stars 136 forks source link

More complicated queries from within python code #76

Open lfaller opened 7 years ago

lfaller commented 7 years ago

Hi all,

I would like to identify all unique kmers of a certain length in a set of input fasta sequences and then report the origin of the kmer as well as the kmer motif.

To do this, I can use the following jellyfish commands manually:

k=9 # for example
jellyfish count -m${k} -s100M -C reference.fasta
jellyfish dump -L 1 -U 1 -c mer_counts.jf

# do some sort of grep to get the headers of the original fasta sequences based on the last output

How can this be translated to python code? It looks like the dump command can be approximated. But the example you give requires a pre-constructed database file. On the other hand, the Python approximation of the count command stores things inside a HashCounter, not a Jellyfish database file...

Is it easiest to approach this problem with a bash script (i.e. using grep) or is there a straight-forward way in Python?

Thanks a bunch! ~Lina