query index through api

kamimrcht / REINDEER

REINDEER REad Index for abuNDancE quERy

GNU Affero General Public License v3.0

56 stars 6 forks source link

query index through api #19

Open chklopp opened 2 years ago

chklopp commented 2 years ago

I would like to try to use a Reindeer index to locate millions of kmers of interest in a genome. Instead of transforming the genome in its kmers fasta file, which would be long and use disk space, to perform the query I was wondering if it was possible to query the index from a python script which would split the genome in kmer in memory, query the index and output the positions found in the index.

kamimrcht commented 2 years ago

Hello Christophe, The main purpose of Reindeer is not only to locate, but to obtain the abundance of k-mers in a collection of samples. If your interest is only to find k-mer presence/absence, I can advise you with other methods. In particular, do you need extremely precise, k-mer-wise queries, or are you looking for gene-length sequences?

chklopp commented 2 years ago

Yes, I work with exact kmers because I'm interested in locating kmer found a low number of times in the genome. They are tens of millions of them and therefore I need an efficient way to locate them.

kamimrcht commented 2 years ago

So you have an assembled genome and you would like the k-mers' positions in it?

chklopp commented 2 years ago

Exactly. For each kmer I would like all the chr/positions at which it is located in the genome.

kamimrcht commented 2 years ago

So in that case I would recommend https://github.com/COMBINE-lab/pufferfish First, you need to index your genome with Pufferfish. Second, I quote Rob Patro on this: pufferfish kquery -I <index> -q <fasta> --threads <nthreads> > result_file where fasta is a file with one k-mer per fasta record, and the output is all written to stdout. the file is a simple format, a SAM header, followed by a set of records. the header line is the record id followed by the number of times the k-mer occurs (0 if it doesn't occur) followed by one line per occurrence, of the format reference_id, position, orientation (but separated by '\t')

chklopp commented 2 years ago

Thank you, I will try this.

chklopp commented 2 years ago

I succeeded installing and indexing the genome with pufferfish but when I launched kquery I got :

There is no command "kquery"
The valid commands to pufferfish are : {align, examine, index, kquery, lookup, stat, validate}

I opened an issue.