[Feature] RdRp / sOTU search based on sequence

To enable Open Virome to search via sequence input, an input sequence must be compared against palmDB sOTU, which is a fasta file. See schematic from palmID

Performing this alignment, plus a percentage query slider return (see Issue #27) will enable searching all palmDB/SRA based on a query

My experience is that this plot has been the most informative for interpreting this information: which could be rendered IFF a query contains a sequence-based entry (i.e. sOTU xxxx +20% divergence)

How to implement

palmID performs a search to extract a palmprint first, this is unneccesary and we can search the input sequence directly against palmDB sOTU. For now we should support protein sequence as input only.

# RUN DIAMOND =============================================
# Uses a PROTEIN SEQUENCE as input

echo ''
echo '-- running DIAMOND search of palmDB...'
echo ''

# diamond 1e-6 cutoff 
diamond blastp \
  -q $OUTDIR/$OUTNAME.trim.fa\
  -d $DB \
  --masking 0 -e 0.00001 \
  --tmpdir /tmp \
  --ultra-sensitive -k0 \
  -f 6 qseqid  qstart qend qlen \
       sseqid  sstart send slen \
       pident evalue cigar \
       full_sseq \
  > $OUTDIR/$OUTNAME.pro.tmp

# Sort by alignment identity
sort -nr -k9 $OUTDIR/$OUTNAME.pro.tmp > $OUTDIR/$OUTNAME.pro
rm $OUTDIR/$OUTNAME.pro.tmp

echo " hits in palmDB: $(wc -l $OUTDIR/$OUTNAME.pro)"

This will return all hits in palmDB. Those sequence identifiers form the basis of the Virome Query.

Limits

Return only 200 top-matches, otherwise things like Narnaviruses will query for 10,000s of sOTU at once and grind to a halt
Allow for returning of matches within a range from 0-100 of percent identity as reported in diamond pident field (you could probably simplify the output reporting to be minimal to the information we need)
In this mode, all returning sOTU in the query would only return exact sOTU matches.
The output of the lambda for diamond which is a list of sOTU and their pid can itself be used with palm_graph to query for known sOTU based on other sOTU related to them and a set threshold of identity.

serratus-bio / open-virome

[Feature] RdRp / sOTU search based on sequence #62

How to implement

Limits