Open chklopp opened 2 years ago
Hello Christophe, The main purpose of Reindeer is not only to locate, but to obtain the abundance of k-mers in a collection of samples. If your interest is only to find k-mer presence/absence, I can advise you with other methods. In particular, do you need extremely precise, k-mer-wise queries, or are you looking for gene-length sequences?
Yes, I work with exact kmers because I'm interested in locating kmer found a low number of times in the genome. They are tens of millions of them and therefore I need an efficient way to locate them.
So you have an assembled genome and you would like the k-mers' positions in it?
Exactly. For each kmer I would like all the chr/positions at which it is located in the genome.
So in that case I would recommend https://github.com/COMBINE-lab/pufferfish
First, you need to index your genome with Pufferfish. Second, I quote Rob Patro on this:
pufferfish kquery -I <index> -q <fasta> --threads <nthreads> > result_file
where fasta is a file with one k-mer per fasta record, and the output is all written to stdout.
the file is a simple format, a SAM header, followed by a set of records.
the header line is the record id followed by the number of times the k-mer occurs (0 if it doesn't occur)
followed by one line per occurrence, of the format reference_id, position, orientation (but separated by '\t')
Thank you, I will try this.
I succeeded installing and indexing the genome with pufferfish but when I launched kquery I got :
There is no command "kquery" The valid commands to pufferfish are : {align, examine, index, kquery, lookup, stat, validate}
I opened an issue.
I would like to try to use a Reindeer index to locate millions of kmers of interest in a genome. Instead of transforming the genome in its kmers fasta file, which would be long and use disk space, to perform the query I was wondering if it was possible to query the index from a python script which would split the genome in kmer in memory, query the index and output the positions found in the index.