serratus-bio / open-virome

monorepo for data explorer UI and APIs
http://openvirome.com/
GNU Affero General Public License v3.0
0 stars 0 forks source link

[Feature] RdRp / sOTU search based on sequence #62

Open ababaian opened 1 month ago

ababaian commented 1 month ago

To enable Open Virome to search via sequence input, an input sequence must be compared against palmDB sOTU, which is a fasta file. See schematic from palmID

image

Performing this alignment, plus a percentage query slider return (see Issue #27) will enable searching all palmDB/SRA based on a query

My experience is that this plot has been the most informative for interpreting this information: image which could be rendered IFF a query contains a sequence-based entry (i.e. sOTU xxxx +20% divergence)

How to implement

palmID performs a search to extract a palmprint first, this is unneccesary and we can search the input sequence directly against palmDB sOTU. For now we should support protein sequence as input only.

# RUN DIAMOND =============================================
# Uses a PROTEIN SEQUENCE as input

echo ''
echo '-- running DIAMOND search of palmDB...'
echo ''

# diamond 1e-6 cutoff 
diamond blastp \
  -q $OUTDIR/$OUTNAME.trim.fa\
  -d $DB \
  --masking 0 -e 0.00001 \
  --tmpdir /tmp \
  --ultra-sensitive -k0 \
  -f 6 qseqid  qstart qend qlen \
       sseqid  sstart send slen \
       pident evalue cigar \
       full_sseq \
  > $OUTDIR/$OUTNAME.pro.tmp

# Sort by alignment identity
sort -nr -k9 $OUTDIR/$OUTNAME.pro.tmp > $OUTDIR/$OUTNAME.pro
rm $OUTDIR/$OUTNAME.pro.tmp

echo " hits in palmDB: $(wc -l $OUTDIR/$OUTNAME.pro)"

This will return all hits in palmDB. Those sequence identifiers form the basis of the Virome Query.

Limits