ksumngs / yavsap

Yet Another Viral Subspecies Analysis Pipeline
https://ksumngs.github.io/yavsap
MIT License
2 stars 1 forks source link

[Feature]: Add an indication of which genotype the sample (most likely) belongs to #14

Closed MillironX closed 2 years ago

MillironX commented 2 years ago

This request comes from Rachel Palinski, Bill Wilson, and Dana Mitzel

Summary

Place a table on the front page of The Visualizer to tell which genotype/strain each sample's consensus sequence and haplotypes BLAST toward

Added Features

Additional processes

This feature may need to be implemented in Python/R/Julia within a new process block. It should not require any new tools, however.

Additional visualizer section

seq graph.zip seq graph screenshot

The attached file contains an HTML page with a prototype of the genome table design. The table contains columns displaying

  1. The sample name
  2. The haplotype name
  3. The haplotype abundance within that sample
  4. The genotype/strain name
  5. A link to the GenBank record for that genotype
  6. The annotated sequence of the haplotype
    • This sequence is color-coded by base, and highlights any variant positions in each haplotype sequence. It also scrolls sideways

This graph should go front-and-center on the home page of The Visualizer.

More Info

Context

Dr. Palinski wanted easy to read genome calls. It took me a while to figure out the best place to put them. Bill and Dana like pretty graphs. They couldn't really tell me what the graphs look like, so I guessed and came up with this. It is very information-dense and should please everyone.

Possible implementation

To pull this off, we will need to:

  1. Convert all haplotype YAMLs into haplotype fastas, while maintaining frequency data
    • haplotyping:HAPLINK_FASTA currently converts, while SIMULATED_READS:HAPLOTYPE_DEPTH calculates depth from single-haplotype YAML files. This will need to be rethought
  2. Concatenate the following for all samples:
    • Haplotype fastas + frequencies
    • Consensus sequences
  3. Perform alignment of each sequence to the reference genome of params.genome
    • Each of these aligned sequences needs to be exactly the same length, so a multi-alignment using MAFFT might be the best option
    • In the case of multi-alignment, conversion into a metadata-rich like Nexus might be useful for maintaining frequency data
  4. Take every one of those sequences, associate it back to its sample, and print it to an HTML table
    • We could make this on-the-fly in Node.js, but I think it would be far better to create the table on pipeline run, then <iframe> or include it in The Visualizer statically.