The dataset we will be working are the practice dataset from the H3ABioNet 16S rDNA diversity analysis SOP. The source data can be accessed here but for our purposes it is already on the cluster and stored here:/scratch/DB/bio/training/16SrRNA/dog_stool_samples
The table below contains the metadata associated with the dog stool samples. There are three dogs which are treated with increased percentage of a compound in their diet: 5 different treatments (0-4, representing an increased percentage of a compound in their diet).
Sample | Dog | Treatment | Read counts r1 | Read counts r2 |
---|---|---|---|---|
Dog1 | B | 2 | 118343 | 118343 |
Dog2 | G | 3 | 108679 | 108679 |
Dog3 | K | 3 | 101482 | 101482 |
Dog8 | B | 4 | 108731 | 108731 |
Dog9 | G | 0 | 109500 | 109500 |
Dog10 | K | 4 | 79342 | 79342 |
Dog15 | B | 1 | 131483 | 131483 |
Dog16 | G | 4 | 114424 | 114424 |
Dog17 | K | 0 | 99610 | 99610 |
Dog22 | B | 3 | 145029 | 145029 |
Dog23 | G | 1 | 193158 | 193158 |
Dog24 | K | 2 | 162487 | 162487 |
Dog29 | B | 0 | 122776 | 122776 |
Dog30 | G | 2 | 137315 | 137315 |
Dog31 | K | 1 | 150613 | 150613 |
/scratch/DB/bio/training/16SrRNA/16SrRNA-hex-tutorial/results
qsub -I -q UCTlong -l walltime=08:00:00
Once you are on a compute node you will see that the prompt changes from @srvslshpc001
to @srvslshpc60X
e.g.
gerrit@srvslshpc001:~> qsub -I -q UCTlong -l walltime=08:00:00
qsub: waiting for job 1598565.srvslshpc001 to start
qsub: job 1598565.srvslshpc001 ready
gerrit@srvslshpc601:~> hostname
srvslshpc601
source /scratch/DB/bio/training/16SrRNA/16SrRNA-hex-tutorial/config/activate_soft.sh
Now set some variables. For the process_dir
set replace hpc30
with the name that has been given to you in the class
raw_reads_dir=/scratch/DB/bio/training/16SrRNA/dog_stool_samples
process_dir=/researchdata/fhgfs/hpc30
uparse_dir=$process_dir/uparse
taxonomy_dir=$process_dir/tax
alignment_dir=$process_dir/align
greengenes_db=/scratch/DB/bio/qiime/greengenes/gg_13_8_otus
gold_db=/scratch/DB/bio/qiime/uchime/gold.fa
sid_fastq_pair_list=/scratch/DB/bio/training/16SrRNA/16SrRNA-hex-tutorial/sid.fastq_pair.list
fastqc_dir=$process_dir/fastqc
mkdir $fastqc_dir
fastqc --extract -f fastq -o $fastqc_dir -t 1 $raw_reads_dir/*
fastqc_combine.pl -v --out $fastqc_dir --skip --files "$fastqc_dir/*_fastqc"
Now lets view the reports.
renamed_dir=$uparse_dir"/renamed"
mkdir -p $renamed_dir
while read sid_fastq_pair; do sid=`echo $sid_fastq_pair | awk -F ' ' '{print $1}'`; fastq_r1=`echo $sid_fastq_pair | awk -F ' ' '{print $2}'`; fastq_r2=`echo $sid_fastq_pair | awk -F ' ' '{print $3}'`; fastq_r1_renamed=$renamed_dir"/"$(basename $fastq_r1); fastq_r2_renamed=$renamed_dir"/"$(basename $fastq_r2); rename_fastq_headers.sh $sid $fastq_r1 $fastq_r2 $fastq_r1_renamed $fastq_r2_renamed;done < $sid_fastq_pair_list
This will take about 10 minutes to run. Lets have a look at the headers once done.
fastq_maxdiffs=3
merged_dir=$uparse_dir"/merged"
mkdir $merged_dir
while read sid_fastq_pair; do sid=`echo $sid_fastq_pair | awk -F ' ' '{print $1}'`; fastq_r1=`echo $sid_fastq_pair | awk -F ' ' '{print $2}'`; fastq_r2=`echo $sid_fastq_pair | awk -F ' ' '{print $3}'`; fastq_r1_renamed=$renamed_dir"/"$(basename $fastq_r1); fastq_r2_renamed=$renamed_dir"/"$(basename $fastq_r2); usearch9 -fastq_mergepairs $fastq_r1_renamed -reverse $fastq_r2_renamed -fastq_maxdiffs $fastq_maxdiffs -fastqout $merged_dir"/"$sid".merged.fastq";done < $sid_fastq_pair_list
This will take about 1 minute to run. Lets have a look at the fastq files of the merge reads.
fastq_maxee=0.1
filtered_dir=$uparse_dir"/filtered"
mkdir $filtered_dir
while read sid_fastq_pair; do sid=`echo $sid_fastq_pair | awk -F ' ' '{print $1}'`; usearch9 -threads 1 -fastq_filter $merged_dir"/"$sid".merged.fastq" -fastq_maxee $fastq_maxee -fastqout $filtered_dir"/"$sid".merged.filtered.fastq" ;done < $sid_fastq_pair_list
This will take about 1 minute to run. Lets do a read count on the filtered fastqs.
filtered_fastqc_dir=$uparse_dir"/filtered.fastqc"
mkdir $filtered_fastqc_dir
fastqc --extract -f fastq -o $uparse_dir"/filtered.fastqc" -t 1 $filtered_dir/*.fastq
This will take about 2 minutes to run.
fastqc_combine.pl -v --out $filtered_fastqc_dir --skip --files "$filtered_fastqc_dir/*_fastqc"
Lets have a look at the FastQC summaries and see if we notice any changes from the FastQC reports on the raw reads.
filtered_fasta_dir=$uparse_dir"/filtered.fasta"
mkdir $filtered_fasta_dir
for i in `ls -1 $filtered_dir/*.fastq`; do filename=$(basename "$i"); base="${filename%.*}"; seqtk seq -A $i > $filtered_fasta_dir/$base.fa; done
Lets have a look at the fasta format.
cat $filtered_fasta_dir/*.fa > $uparse_dir/filtered_all.fa
cat $uparse_dir/filtered_all.fa | grep -v "^>" | grep -v [^ACGTacgt] | sort -d | uniq -c | while read abundance sequence ; do hash=$(printf "${sequence}" | sha1sum); hash=${hash:0:40};printf ">%s;size=%d;\n%s\n" "${hash}" "${abundance}" "${sequence}"; done > $uparse_dir/filtered_all.uniques.fa 2> $uparse_dir/filtered_all.uniques.fa.e
This will take about 15 minutes. Lets have a look at the headers.
Sort by size
min_size=2
usearch9 -sortbysize $uparse_dir/filtered_all.uniques.fa -fastaout $uparse_dir/filtered_all.uniques.sorted.fa -minsize $min_size
Do OTU picking
otu_radius_pct=3
usearch9 -cluster_otus $uparse_dir/filtered_all.uniques.sorted.fa -otu_radius_pct $otu_radius_pct -otus $uparse_dir/otus_raw.fa
This will take about 30 seconds. Once done lets count how many OTUs were generated.
usearch9 -threads 1 -uchime2_ref $uparse_dir/otus_raw.fa -db $gold_db -mode high_confidence -strand plus -notmatched $uparse_dir/otus_chimOUT.fa
This will take about 10 seconds. Once done lets check how many OTUs were detected as being chimeric.
Rename OTU headers
fasta_number.py $uparse_dir/otus_chimOUT.fa OTU_ > $uparse_dir/otus_repsetOUT.fa
Split fasta files to reduce memory on usearch_global
run
mkdir $uparse_dir/split_files
cd $uparse_dir/split_files
fasta-splitter.pl --n-parts 100 $uparse_dir/filtered_all.fa
Do de-dereplication
for i in $(ls $uparse_dir/split_files/*.fa); do usearch9 -usearch_global $i -db $uparse_dir/otus_repsetOUT.fa -id 0.97 -strand plus -uc $i.map.uc; done
Combine mappings
cat $uparse_dir/split_files/*.map.uc > $uparse_dir/otus_mappedOUT.uc
This will take about 1 minute to complete. Lets have a look at the final mapping file.
uc2otutab.py $uparse_dir/otus_mappedOUT.uc > $uparse_dir/otus_table.tab.txt
This will take about 20 seconds to complete. Have a look that the OTU table generated.
mkdir $taxonomy_dir
assign_taxonomy.py -i $uparse_dir/otus_repsetOUT.fa -o $taxonomy_dir -r $greengenes_db/rep_set/97_otus.fasta -t $greengenes_db/taxonomy/97_otu_taxonomy.txt -m uclust
This will take about a minutee to complete. Let look at the GreenGenes files and also the final output.
For downstream analysis we need a .biom file. Lets create that from the OTU table.
biom convert -i $uparse_dir/otus_table.tab.txt --table-type="OTU table" -o $process_dir/otus_table.biom
Now lets add the taxonomy annotation to the .biom file.
biom add-metadata -i $process_dir/otus_table.biom -o $process_dir/otus_table.tax.biom --observation-metadata-fp $taxonomy_dir/otus_repsetOUT_tax_assignments.txt --observation-header OTUID,taxonomy,confidence --sc-separated taxonomy --float-fields confidence
Lets have a look if the annotation has been added.
Allign sequences against a template database
mkdir $alignment_dir
align_seqs.py -m pynast -i $uparse_dir/otus_repsetOUT.fa -o $alignment_dir -t $greengenes_db/rep_set_aligned/97_otus.fasta
This will take about 5 minutes to complete. Have a look at the output.
Now filter the alignment to remove gaps.
filter_alignment.py -i $alignment_dir/otus_repsetOUT_aligned.fasta -o $alignment_dir/filtered
Create a phylogenetic tree
make_phylogeny.py -i $alignment_dir/filtered/otus_repsetOUT_aligned_pfiltered.fasta -o $process_dir/otus_repsetOUT_aligned_pfiltered.tre
biom summarize-table -i $process_dir/otus_table.tax.biom -o $process_dir/otus_table.tax.biom.summary.quantative
biom summarize-table --qualitative -i $process_dir/otus_table.tax.biom -o $process_dir/otus_table.tax.biom.summary.qualitative