refresh-bio / KMC

Fast and frugal disk based k-mer counter
256 stars 73 forks source link

easy way to dump kmers with certain prefixes? #112

Open chunlinxiao opened 5 years ago

chunlinxiao commented 5 years ago

Hi,

is there any fast way that kmc just dump kmers with certain prefixes, other than looking through the whole kmc database?

thanks

chunlin

marekkokot commented 5 years ago

Hi,

It would be possible, but currently, our software does not support such an option. Nevertheless, you may implement it on your own (unfortunately it is rather not trivial). K-mers in kmc database (.kmc_pre, .kmc_suf files) are distributed among a number of (512 in default mode) bins. Each bin contains k-mers in sorted order (they are stores using additional LUT table to save a space and speed up search), so a binary search may be used to find boundaries of k-mers with specified prefixes in each bin. In some cases, however, k-mers are not distributed to bins (if kmc database is an output of most of kmc_tools operations, or when kmc was run with small k (about 14 or less)). There is a value that informs which case is in the given database.

You may always convert the most complex case of k-mers distributed to bins to simpler when all k-mers are sorted using kmc_tools sort operation.

The binary format of the database is described in this document: https://github.com/refresh-bio/KMC/blob/master/API.pdf In the case when all k-mers are sorted and not distributed to bins the database format is described here: https://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-14-160/MediaObjects/12859_2012_5911_MOESM1_ESM.pdf

If you decide to implement it, you may use kmc_dump source code as a reference, because it supports both database formats. In general, kmc_dump uses kmc_api, whose source codes is avaiable here: https://github.com/refresh-bio/KMC/tree/master/kmc_api

You may also use kmc_tools source code as a reference, but it is more coplex and will require more time, but kmc_tools dump operation is a little faster than kmc_dump.

In case of any questions do not hesitate and ask.

We will consider adding support for dump operation witthe h specified prefix, but unfortunatelly even if we decide to implement it, it will be for sure not in the near future.

chunlinxiao commented 5 years ago

thank you marekkokot very much for the detailed reply.

I just installed your kmc3.1.1 successfully from the source codes. But when I tried to compile kmc_dump_sample.cpp under directory of kmc_dump_sample using g++ ( 5.4.0 ), I have the following error:

g++ kmc_dump_sample.cpp

In file included from kmc_dump_sample.cpp:16:0: ../kmc_api/kmc_file.h:90:8: error: expected nested-name-specifier before ‘super_kmers_t’ using super_kmers_t = std::vector<std::tuple<uint32, uint32, uint32>>;//start_pos, len, bin_n ^ ../kmc_api/kmc_file.h:91:58: error: ‘super_kmers_t’ has not been declared void GetSuperKmers(const std::string& transformed_read, super_kmers_t& super_kmers);

did I miss anything?

thanks

marekkokot commented 5 years ago

Hi, probably you should also specify -std=c++11, but there may be some missing references. Try to replace makefile content in the main directory with:

all: kmc

KMC_BIN_DIR = bin
KMC_MAIN_DIR = kmer_counter
KMC_API_DIR = kmc_api
KMC_DUMP_DIR = kmc_dump
KMC_DUMP_SAMPLE_DIR = kmc_dump_sample
KMC_TOOLS_DIR = kmc_tools

CC  = g++
CFLAGS  = -Wall -O3 -m64 -static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++11 
CLINK   = -lm -static -O3 -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++11 

KMC_TOOLS_CFLAGS    = -Wall -O3 -m64 -static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++14
KMC_TOOLS_CLINK = -lm -static -O3 -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++14

DISABLE_ASMLIB = false

KMC_OBJS = \
$(KMC_MAIN_DIR)/kmer_counter.o \
$(KMC_MAIN_DIR)/mmer.o \
$(KMC_MAIN_DIR)/mem_disk_file.o \
$(KMC_MAIN_DIR)/rev_byte.o \
$(KMC_MAIN_DIR)/bkb_writer.o \
$(KMC_MAIN_DIR)/cpu_info.o \
$(KMC_MAIN_DIR)/bkb_reader.o \
$(KMC_MAIN_DIR)/fastq_reader.o \
$(KMC_MAIN_DIR)/timer.o \
$(KMC_MAIN_DIR)/develop.o \
$(KMC_MAIN_DIR)/kb_completer.o \
$(KMC_MAIN_DIR)/kb_storer.o \
$(KMC_MAIN_DIR)/kmer.o \
$(KMC_MAIN_DIR)/prob_qual.o
RADULS_OBJS = \
$(KMC_MAIN_DIR)/raduls_sse2.o \
$(KMC_MAIN_DIR)/raduls_sse41.o \
$(KMC_MAIN_DIR)/raduls_avx2.o \
$(KMC_MAIN_DIR)/raduls_avx.o 

KMC_LIBS = \
$(KMC_MAIN_DIR)/libs/libz.a \
$(KMC_MAIN_DIR)/libs/libbz2.a

KMC_DUMP_OBJS = \
$(KMC_DUMP_DIR)/nc_utils.o \
$(KMC_DUMP_DIR)/kmc_dump.o 

KMC_DUMP_SAMPLE_OBJS = \
$(KMC_DUMP_SAMPLE_DIR)/kmc_dump_sample.o

KMC_API_OBJS = \
$(KMC_API_DIR)/mmer.o \
$(KMC_API_DIR)/kmc_file.o \
$(KMC_API_DIR)/kmer_api.o

KMC_TOOLS_OBJS = \
$(KMC_TOOLS_DIR)/kmc_header.o \
$(KMC_TOOLS_DIR)/kmc_tools.o \
$(KMC_TOOLS_DIR)/nc_utils.o \
$(KMC_TOOLS_DIR)/parameters_parser.o \
$(KMC_TOOLS_DIR)/parser.o \
$(KMC_TOOLS_DIR)/tokenizer.o \
$(KMC_TOOLS_DIR)/fastq_filter.o \
$(KMC_TOOLS_DIR)/fastq_reader.o \
$(KMC_TOOLS_DIR)/fastq_writer.o \
$(KMC_TOOLS_DIR)/percent_progress.o

KMC_TOOLS_LIBS = \
$(KMC_TOOLS_DIR)/libs/libz.a \
$(KMC_TOOLS_DIR)/libs/libbz2.a 

ifeq ($(DISABLE_ASMLIB),true)
    CFLAGS += -DDISABLE_ASMLIB
    KMC_TOOLS_CFLAGS += -DDISABLE_ASMLIB
else
    KMC_LIBS += \
    $(KMC_MAIN_DIR)/libs/libaelf64.a 
    KMC_TOOLS_LIBS += \
    $(KMC_TOOLS_DIR)/libs/libaelf64.a 
endif   

$(KMC_OBJS) $(KMC_DUMP_OBJS) $(KMC_API_OBJS) $(KMC_DUMP_SAMPLE_OBJS): %.o: %.cpp
    $(CC) $(CFLAGS) -c $< -o $@

$(KMC_TOOLS_OBJS): %.o: %.cpp
    $(CC) $(KMC_TOOLS_CFLAGS) -c $< -o $@

$(KMC_MAIN_DIR)/raduls_sse2.o: $(KMC_MAIN_DIR)/raduls_sse2.cpp
    $(CC) $(CFLAGS) -msse2 -c $< -o $@
$(KMC_MAIN_DIR)/raduls_sse41.o: $(KMC_MAIN_DIR)/raduls_sse41.cpp
    $(CC) $(CFLAGS) -msse4.1 -c $< -o $@
$(KMC_MAIN_DIR)/raduls_avx.o: $(KMC_MAIN_DIR)/raduls_avx.cpp
    $(CC) $(CFLAGS) -mavx -fabi-version=0 -c $< -o $@
$(KMC_MAIN_DIR)/raduls_avx2.o: $(KMC_MAIN_DIR)/raduls_avx2.cpp
    $(CC) $(CFLAGS) -mavx2 -mfma -fabi-version=0 -c $< -o $@
$(KMC_MAIN_DIR)/instrset_detect.o: $(KMC_MAIN_DIR)/libs/vectorclass/instrset_detect.cpp
    $(CC) $(CFLAGS) -c $< -o $@

kmc: $(KMC_OBJS) $(RADULS_OBJS) $(KMC_MAIN_DIR)/instrset_detect.o 
    -mkdir -p $(KMC_BIN_DIR)
    $(CC) $(CLINK) -o $(KMC_BIN_DIR)/$@ $^ $(KMC_LIBS)

kmc_dump: $(KMC_DUMP_OBJS) $(KMC_API_OBJS)
    -mkdir -p $(KMC_BIN_DIR)
    $(CC) $(CLINK) -o $(KMC_BIN_DIR)/$@ $^

kmc_dump_sample: $(KMC_DUMP_SAMPLE_OBJS) $(KMC_API_OBJS)
    -mkdir -p $(KMC_BIN_DIR)
    $(CC) $(CLINK) -o $(KMC_BIN_DIR)/$@ $^

kmc_tools: $(KMC_TOOLS_OBJS) $(KMC_API_OBJS)
    -mkdir -p $(KMC_BIN_DIR)
    $(CC) $(KMC_TOOLS_CLINK) -o $(KMC_BIN_DIR)/$@ $^ $(KMC_TOOLS_LIBS)

clean:
    -rm $(KMC_MAIN_DIR)/*.o
    -rm $(KMC_API_DIR)/*.o
    -rm $(KMC_DUMP_DIR)/*.o
    -rm $(KMC_TOOLS_DIR)/*.o
    -rm $(KMC_DUMP_SAMPLE_DIR)/*o
    -rm -rf bin

all: kmc kmc_dump kmc_tools kmc_dump_sample

and than run:

make kmc_dump_sample

Let me know if it helps.

chunlinxiao commented 5 years ago

thank you very much - your new makefile works !

Before this, I did try $ g++ kmc_dump_sample.cpp -std=c++11

but it would give me the error with "error: ld returned 1 exit status".

I also tried the following with additional options (from your makefile) - similar errors below - do you have any suggestion for command line compilation ?

$ g++ kmc_dump_sample.cpp -lm -static -O3 -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++11 -o test

kmc_dump_sample.cpp:(.text.startup+0x69): undefined reference to CKMCFile::CKMCFile()' kmc_dump_sample.cpp:(.text.startup+0x18c): undefined reference toCKMCFile::~CKMCFile()' kmc_dump_sample.cpp:(.text.startup+0x265): undefined reference to CKMCFile::OpenForListing(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)' kmc_dump_sample.cpp:(.text.startup+0x2b5): undefined reference toCKMCFile::Info(unsigned int&, unsigned int&, unsigned int&, unsigned int&, unsigned int&, unsigned int&, unsigned long long&, unsigned long long&)' kmc_dump_sample.cpp:(.text.startup+0x39e): undefined reference to CKMCFile::ReadNextKmer(CKmerAPI&, float&)' kmc_dump_sample.cpp:(.text.startup+0x436): undefined reference toCKmerAPI::char_codes' kmc_dump_sample.cpp:(.text.startup+0x479): undefined reference to CKmerAPI::char_codes' kmc_dump_sample.cpp:(.text.startup+0x49c): undefined reference toCKmerAPI::char_codes' kmc_dump_sample.cpp:(.text.startup+0x4c0): undefined reference to CKmerAPI::char_codes' kmc_dump_sample.cpp:(.text.startup+0x539): undefined reference toCKmerAPI::char_codes' /tmp/ccwLOWWX.o:kmc_dump_sample.cpp:(.text.startup+0x55c): more undefined references to CKmerAPI::char_codes' follow /tmp/ccwLOWWX.o: In functionmain': kmc_dump_sample.cpp:(.text.startup+0x5df): undefined reference to CKMCFile::ReadNextKmer(CKmerAPI&, unsigned int&)' kmc_dump_sample.cpp:(.text.startup+0x66e): undefined reference toCKmerAPI::char_codes' kmc_dump_sample.cpp:(.text.startup+0x6f8): undefined reference to CKMCFile::Close()' kmc_dump_sample.cpp:(.text.startup+0x740): undefined reference toCKMCFile::SetMaxCount(unsigned int)' kmc_dump_sample.cpp:(.text.startup+0x774): undefined reference to CKMCFile::SetMinCount(unsigned int)' kmc_dump_sample.cpp:(.text.startup+0x7c5): undefined reference toCKMCFile::~CKMCFile()' collect2: error: ld returned 1 exit status

marekkokot commented 5 years ago

This is because in your command line you do not compile all necessary cpp files. If you really need to compile this way (which I do not recommend) use:

g++ -O3 -std=c++14 kmc_dump_sample.cpp ../kmc_api/*.cpp

Let me know if it works.

chunlinxiao commented 5 years ago

yes it works fine - thank you very much Marek !