sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

2024 STAMPS tutorials - "running k-mer analyses with sourmash" venn & upset diagrams, ordination, and presence/absence diagram #3256

Open ctb opened 1 month ago

ctb commented 1 month ago

https://hackmd.io/9ORFRJGaTOiOdEAY-Aih2A?view


open lab / Sat + Monday , Jul 20 & 22, 2024 / STAMPS 2024

Suggestion for open lab:

Getting started

Start up RStudio on your instance, and click on Terminal.

Setup (do these once)

Create a working directory

mkdir ~/smash-data/

Install sourmash and various plugins with conda (also see a conda tutorial; you'll need to install miniforge if you're not running conda/mamba on your Jetstream computer.)

Then run:

conda env create -f /opt/shared/sourmash-data/env-smash.yml \
    -n smash

This installs:

Change directory and activate environment (each time you log in)

Change to sourmash working directory and activate sourmash software environment:

cd ~/smash-data/
conda activate smash

Might as well check for updates, sourmash is fast moving ;).

pip install -U sourmash_plugin_betterplot

Convert some metagenomes into k-mer signatures

Use sourmash sketch to turn metagenomes into k-mers:

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947178_?.fastq.gz \
    -p k=31,abund -o SRR7947178.sig.zip --name SRR7947178

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947179_?.fastq.gz \
    -p k=31,abund -o SRR7947179.sig.zip --name SRR7947179

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947181_?.fastq.gz \
    -p k=31,abund -o SRR7947181.sig.zip --name SRR7947181

This will take about 5 minutes and create three SRR*.sig.zip files.

(You can grab any metagenome you want from ENA, e.g. SRR7947181; use curl -JLO <read URL to download the FASTQ file(s) and sketch them as above!)

Sketch some genomes into k-mer signatures

Use sourmash sketch to turn genomes into k-mers:

for i in /opt/shared/sourmash-data/10sketches/*.fa
do
    sourmash sketch dna $i -o genome-$(basename $i .fa).sig.zip \
        --name-from-first
done

sourmash sig cat genome*.sig.zip -o 10sketches.sig.zip

Run various analyses on the k-mer sketches!

Genome comparisons via venn and upset

Run:

sourmash scripts venn genome-{2,47,63}.sig.zip \
    -o genomes-venn.png --ident

to produce a file genomes-venn.png:

image

Run:

sourmash scripts upset genome-{2,47,63}.sig.zip \
    --show-singletons -o genomes-upset.png

to produce a file genomes-upset.png:

image

Genome distance matrix, dendrograms, and ordination

First compare the genomes:

sourmash compare 10sketches.sig.zip -o 10sketches.cmp \
    --labels-to 10sketches.labels_to.csv

Now build a matrix+dendrogram view:

sourmash scripts plot2 10sketches.cmp 10sketches.labels_to.csv \
    -o 10sketches.mat.png

to produce 10sketches.mat.png:

image

You can produce a metric MDS plot too, colored by species:

sourmash scripts mds 10sketches.cmp 10sketches.labels_to.csv \
    -o 10sketches.mds.png \
    -C /opt/shared/sourmash-data/10sketches/10sketches-categories.csv

to produce 10sketches.mds.png:

image

Flat (no abund) metagenome comparisons via venn and upset

Run a venn comparison of metagenomes => metag-venn.png.

sourmash scripts venn SRR*.sig.zip \
    -o metag-venn.png

image

Run an upset comparison of metagenomes => metag-upset.png

sourmash scripts upset SRR*.sig.zip \
    --show-singletons -o metag-upset.png

image

Genome presence/absence

Calculate which GTDB genomes are in a metagenome:

sourmash scripts fastgather SRR7947178.sig.zip \
    /opt/shared/sourmash-data/gtdb-rs214-k31.zip \
    -o SRR7947178.x.gtdb-rs214.csv

(will take ~3 minutes)

Make the detection/abundance plot:

sourmash scripts presence_filter SRR7947178.x.gtdb-rs214.csv \
    -o SRR7947178.detection.png

=> SRR7947178.detection.png:

image

sourmash scripts presence_filter SRR7947178.x.gtdb-rs214.csv \
    -o SRR7947178.ani.png \
    --ani

=> SRR7947178.ani.png

image

The three columns being plotted from SRR7947178.x.gtdb-rs214.csv are:

Generating taxonomic classifications for metagenomes with sourmash

The basic workflow is to first run sourmash gather as above, and then run sourmash tax metagenome.

Run gather for each metagenome:

sourmash scripts fastgather SRR7947179.sig.zip \
    /opt/shared/sourmash-data/gtdb-rs214-k31.zip \
    -o SRR7947179.x.gtdb-rs214.csv

sourmash scripts fastgather SRR7947181.sig.zip \
    /opt/shared/sourmash-data/gtdb-rs214-k31.zip \
    -o SRR7947181.x.gtdb-rs214.csv

Then you can run taxonomic analyses like so:

sourmash tax metagenome -g *.x.gtdb-rs214.csv  \
    -t /opt/shared/sourmash-data/gtdb-rs214.lineages.sqldb \
    -r order -F human

See the sourmash tax documentations for various output options!

Getting % of metagenome covered from gather

If you've run a fastgather above, you will still need to run sourmash gather to get nice human-readable output; you can do this quickly by using the fastgather output as a picklist to limit the results to the previous calculated output:

sourmash gather --picklist SRR7947179.x.gtdb-rs214.csv::prefetch \
    SRR7947179.sig.zip \
    /opt/shared/sourmash-data/gtdb-rs214-k31.zip

and you should then see at the end:

found 849 matches total;
the recovered matches hit 83.1% of the abundance-weighted query.
the recovered matches hit 68.7% of the query k-mers (unweighted).

Running abundhist to generate metagenome abundance histograms

After activating the smash conda environment, you can install the abundhist plugin like so:

pip install sourmash_plugin_abundhist

and run it on any metagenome or mixture that has been sketched with -p abund like so:

sourmash scripts abundhist SRR7947178.sig.zip -M 100 \
    --figure SRR7947178.abundhist.png

Other topics we can discuss!

Appendix 1: UNIX Command Line!

Totally new to the command-line, or want to strengthen your foundation? Do this Unix Crash Course.

If you want some experience in other aspects, consider going through these:

Appendix 2: Conda tutorial

Please see my conda tutorial; you'll need to install miniforge if you're not running things on your Jetstream computer.

Appendix 3: examining assembly overlap with k-mers

See my in-class notes.

Contact info

Contact Titus at ctbrown@ucdavis.edu.

ctb commented 1 month ago

TODO: make setup instructions / files available somewhere!

ctb commented 1 month ago

installation stuff/sourmash

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

conda env create -n smash-foo -f /opt/shared/sourmash-data/env-smash.yml

Appendix: sourmash env file

cat > env-smash.yml <<EOF
name: smash
channels:
    - conda-forge
    - bioconda
    - defaults
dependencies:
    - sourmash>=4.8.10,<5
    - seaborn
    - scikit-learn
    - seaborn
    - sourmash_plugin_branchwater==0.9.5
    - pip
    - pip:
      - sourmash_plugin_betterplot
      - sourmash_plugin_abundhist
EOF

Appendix 2

PatientHNC_06_illumina

curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/008/SRR7947178/SRR7947178_1.fastq.gz
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/008/SRR7947178/SRR7947178_2.fastq.gz

PatientHNC_03_illumina

curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/009/SRR7947179/SRR7947179_1.fastq.gz
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/009/SRR7947179/SRR7947179_2.fastq.gz

PatientHNC_10_illumina

curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/001/SRR7947181/SRR7947181_1.fastq.gz
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/001/SRR7947181/SRR7947181_2.fastq.gz

Sketching

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947178_?.fastq.gz -p k=31,abund -o SRR7947178.sig.zip --name SRR7947178

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947179_?.fastq.gz -p k=31,abund -o SRR7947179.sig.zip --name SRR7947179

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947181_?.fastq.gz -p k=31,abund -o SRR7947181.sig.zip --name SRR7947181
ctb commented 1 month ago

thoughts for next year -