gmarcais / Jellyfish

A fast multi-threaded k-mer counter
Other
463 stars 136 forks source link

[request] Example of string to map of kmer and counts #23

Closed danpolanco closed 9 years ago

danpolanco commented 9 years ago

Before I dive into the source, I wanted to see how hard it would be to use Jellyfish as a library to go from sequence to kmer and count.

My function interface would look something like:

#include <string>
#include <map>
std::map <std::string, unsigned int> count_kmers( std::string &sequence );

And then in the definition, perhaps something like:

#include ....
#include "ext/jellyfish/whole_sequence_parser.hpp"
#include "ext/jellyfish/jellyfish.hpp"
#include ...

using std::map;
using std::string;

map <string, unsigned int> count_kmers( string &sequence ) {
    map <string, unsigned int> counted_kmers;
    // open file
    // jellyfish code?
    return counted_kmers;
}

Right now I use system() but I would love to include it as a native library instead.

Btw, great work! I looked at a few other k-mer counters and came back to Jellyfish shortly after :+1:

Edit: removed some unnecessary code.

gmarcais commented 9 years ago

Hi Dan,

Jellyfish does not count k-mers in a standard map. So using Jellyfish to parse a sequence is not going to have that function signature. Second, if your sequence is not very big (likely if coming from a std::string), then using Jellyfish is probably not going to get you much.

Now, if you are parsing sequence from reasonably large file, I can provide an example on how to parse it directly with Jellyfish in C++. The result will be in a Jellyfish hash array which is not a drop in replacement for a std::map.

If this is what you want, I'll add such an example in the examples directory.

danpolanco commented 9 years ago

As it stands, I import the genome in a std::string and then split each contig into its own string. Then, I process each contig by itself including using Jellyfish to count k-mers. I'm working with the human genome by the way.

In the future I'll also process the whole genome and the contigs sequentially. I'm new to C++ however and I don't know what problems I'll run into once I process the whole genome along side each contig. For now though each contig in a string works well.

In any case, I should be able to read out all of the k-mers in the JF hash array and their counts, correct? If so, I think an example would help greatly if you can spare the time. If I can get those two things (k-mer and associated count) then I don't need a map.

FYI: I tried creating my own counter and it was horrifically slow even with a single contig haha. So Jellyfish is definitely a life line for my work.

danpolanco commented 9 years ago

I was able to figure it out using one of your examples. Thanks!