To enable Open Virome to search via sequence input, an input sequence must be compared against palmDB sOTU, which is a fasta file. See schematic from palmID
Performing this alignment, plus a percentage query slider return (see Issue #27) will enable searching all palmDB/SRA based on a query
My experience is that this plot has been the most informative for interpreting this information:
which could be rendered IFF a query contains a sequence-based entry (i.e. sOTU xxxx +20% divergence)
How to implement
palmID performs a search to extract a palmprint first, this is unneccesary and we can search the input sequence directly against palmDB sOTU. For now we should support protein sequence as input only.
This will return all hits in palmDB. Those sequence identifiers form the basis of the Virome Query.
Limits
Return only 200 top-matches, otherwise things like Narnaviruses will query for 10,000s of sOTU at once and grind to a halt
Allow for returning of matches within a range from 0-100 of percent identity as reported in diamondpident field (you could probably simplify the output reporting to be minimal to the information we need)
In this mode, all returning sOTU in the query would only return exact sOTU matches.
The output of the lambda for diamond which is a list of sOTU and their pid can itself be used with palm_graph to query for known sOTU based on other sOTU related to them and a set threshold of identity.
To enable
Open Virome
to search via sequence input, an input sequence must be compared againstpalmDB
sOTU, which is a fasta file. See schematic frompalmID
Performing this alignment, plus a percentage query slider return (see Issue #27) will enable searching all palmDB/SRA based on a query
My experience is that this plot has been the most informative for interpreting this information: which could be rendered IFF a query contains a sequence-based entry (i.e. sOTU
xxxx
+20% divergence)How to implement
palmID performs a search to extract a palmprint first, this is unneccesary and we can search the input sequence directly against palmDB sOTU. For now we should support protein sequence as input only.
This will return all hits in palmDB. Those sequence identifiers form the basis of the
Virome Query
.Limits
range
from 0-100 of percent identity as reported indiamond
pident
field (you could probably simplify the output reporting to be minimal to the information we need)sOTU
in the query would only return exact sOTU matches.diamond
which is a list of sOTU and theirpid
can itself be used withpalm_graph
to query for known sOTU based on other sOTU related to them and a set threshold of identity.