nextgenusfs / amptk

AMPtk: Amplicon ToolKit for NGS data (formally UFITS)
http://amptk.readthedocs.io/
BSD 2-Clause "Simplified" License
38 stars 14 forks source link

Matching MockA, MockB1 and MockB2 to FASTQ filenames #90

Closed peterjc closed 2 years ago

peterjc commented 3 years ago

Cross reference #74 which was also about Figure 4 https://doi.org/10.7717/peerj.4925/fig-4 in the paper:

The terms MockA, MockB1 and MockB2 do not appear in the FASTQ filenames or SRA/ENA metadata which instead have names like these (focusing on the Illumina samples - the Ion Torrent names are similar):

It is clear from the SRA/ENA metadata that:

I am inferring from dilutions in the Figure 4 caption that:

Is that correct so far?

What is the meaning of the prefix m3, m4A, m4b, m4C, m5 or m6 in the filenames?

Related to that, which of Illumina samples m4A-mock3-*, m4C-mock3-* and m6-mock3-* are shown in the Figure 4 heatmap, and Figure 6?

Thank you

nextgenusfs commented 2 years ago

Hi @peterjc, sorry for my delayed response.

Because of how you submit samples to SRA they have to be minimally processed, ie each sample is a demuxed FASTQ, thus it is confusing (at least in my opinion). You can get the single file FASTQ.GZ files from the OSF repository instead of trying to re-create from SRA data.

What is the meaning of the prefix m3, m4A, m4b, m4C, m5 or m6 in the filenames?

Because we ended up sequencing the same environmental samples multiple times, it was just a prefix we were using to have unique idenfiers on each sample (ie when we submit to sequencing vendor for MiSeq used these names).

So Figure 4 is derived from the m3 (ion torrent) and the m4A (miseq) datasets. These were chronologically earlier on in this project -- as you can imagine we started with a simple query of trying to validate the read abundances from our new (at the time) Ion Torrent PGM sequencer in comparison to MiSeq, we then of course discovered what we wrote up in the paper along the way. These data were generated by pulling out these sample and mapping the reads to the "biological mock" references sequences.

This is from the OSF Jupyter notebook on how those data were generated:

#while the literature says that it is not a good idea to subsampling your reads so that you have equal number 
#of input reads in each sample will allow you to use abundance metrics, since people still do this, lets see what
#happens with the data if we subsample.  Note this is also incorrectly called rarefying...
#as you would expect, this does not change the results... subsampling doesn't allow you trust abundance
#combine the biomock samples
!cat ion3.biomock.fq.gz m4a.biomock.fq.gz > biomock.fq.gz
!amptk sample -i biomock.fq.gz -n 100000 -o biomock.sampled.fq.gz
#convert to fasta
!vsearch --fastq_filter biomock.sampled.fq.gz --fastaout biomock.sampled.fa
#map to biomock OTUs
!vsearch --usearch_global biomock.sampled.fa --otutabout biomock.sampled.otu_table.txt --id 0.97 \
    --db $HOME/amptk/DB/amptk_mock2.fa
#reorder OTU table for drawing heatmap
!amptk filter -i biomock.sampled.otu_table.txt -f $HOME/amptk/DB/amptk_mock2.fa -p 0 --col_order m3-stds m3-mock3-16000 m3-mock3-32000a m3-mock3-32000b m4A-stds m4A-mock3-16000 m4A-mock3-32000a m4A-mock3-32000b
#draw heatmap
!amptk heatmap -i biomock.sampled.final.txt --color YlOrBr -m heatmap --annotate -o biomock.sampled.pdf

Figure 6 was derived from datasets m5 (ion torrent) and m6 (miseq) which contained our synthetic spike in control (SynMock). To make the figure look they way it did, I did a little extra re-ordering, etc:

#Run UPARSE clustering on Miseq and Ion datasets containing SynMock
!amptk cluster -i ion5_1mm.demux.fq.gz -o ion5_1mm
!amptk cluster -i ion5.demux.fq.gz -o ion5
!amptk cluster -i m6_1mm.demux.fq.gz -o m6_1mm
!amptk cluster -i m6.demux.fq.gz -o m6

#using amptk filter command we can map the OTUs to the mock community
!amptk filter -i ion5_1mm.otu_table.txt -f ion5_1mm.cluster.otus.fa -p 0 -b m5-SynMock -m synmock -o ion5_1mm --keep_mock
!amptk filter -i ion5.otu_table.txt -f ion5.cluster.otus.fa -p 0 -b m5-SynMock -m synmock -o ion5 --keep_mock
!amptk filter -i m6_1mm.otu_table.txt -f m6_1mm.cluster.otus.fa -p 0 -b m6-SynMock -m synmock -o m6_1mm --keep_mock
!amptk filter -i m6.otu_table.txt -f m6.cluster.otus.fa -p 0 -b m6-SynMock -m synmock -o m6 --keep_mock

#now want to take a look at synmock distribution in entire miseq6 run, see where bleed is happening
#use the 0 mismatch dataset as that is what I think one should use.

import pandas as pd
from natsort import natsorted
#load the miseq table, no filtering version
df = pd.read_csv('m6.final.txt', sep='\t', index_col=0)
#re-order the columns 
keepers = ['m6-301-1', 'm6-301-2', 'm6-500-1', 'm6-500-2', 'm6-712-1', 'm6-712-2', 
           'm6-736-1', 'm6-736-2', 'm6-744-1', 'm6-744-2', 'm6-755-1', 'm6-755-2', 
           'm6-757-1', 'm6-757-2', 'm6-766-1', 'm6-766-2', 
           'm6-mock3-16000', 'm6-mock3-32000a', 'm6-mock3-32000b', 'm6-stds', 'm6-SynMock']
df2 = df.reindex(columns=keepers, index=natsorted(df.index))
mocks = []
for x in df2.index:
    if x.startswith('mock'):
        mocks.append(x)
otus = [item for item in df2.index if item not in mocks]

df3 = df2.reindex(columns=keepers, index=mocks)
df3.to_csv('figure.m6.onlymocks.txt', sep='\t')
df4 = df2.reindex(columns=keepers, index=otus)
#get number of OTUs from df4 as well as total reads
reads = df4.sum(axis=0)
reads.name ="Other OTUs"
otu_count = df4[df4 > 0].count(axis=0, numeric_only=True)
otu_count.name = "Number of OTUs"
#print reads
#print otu_count
#now append the read numbers to df3 dataframe
df3 = df3.append(reads)
#print df3
total_reads = df3.sum(axis=0)
#print total_reads
df3.to_csv('figure.m6_synmock.txt', sep='\t')

!amptk heatmap -i figure.m6_synmock.txt -d tsv -o figure3.pdf --format pdf -m heatmap --annotate --color YlOrBr --vmax 120000

# same method with the ion torrent m5 dataset
df = pd.read_csv('ion5.final.txt', sep='\t', index_col=0)
#Figure 2 should be about index bleed, so generate a table containing the samples that were run in all runs
keepers = ['m5-301-1', 'm5-301-2', 'm5-500-1', 'm5-500-2', 'm5-712-1', 'm5-712-2', 'm5-736-1', 'm5-736-2', 
           'm5-744-1', 'm5-744-2', 'm5-755-1', 'm5-755-2', 'm5-757-1', 'm5-757-2', 'm5-766-1', 'm5-766-2', 
           'm5-mock3.0-16000', 'm5-mock3.5-16000', 'm5-SynMock1.736-1.736-2', 'm5-SynMock2.757-1.757-2',
           'm5-SynMock']

df2 = df.reindex(columns=keepers, index=natsorted(df.index))
mocks = []
for x in df2.index:
    if x.startswith('mock'):
        mocks.append(x)
otus = [item for item in df2.index if item not in mocks]

df3 = df2.reindex(columns=keepers, index=mocks)
df3.to_csv('figure6.ion.onlymocks.txt', sep='\t')
df4 = df2.reindex(columns=keepers, index=otus)
#get number of OTUs from df4 as well as total reads
reads = df4.sum(axis=0)
reads.name ="Other OTUs"
otu_count = df4[df4 > 0].count(axis=0, numeric_only=True)
otu_count.name = "Number of OTUs"
#print reads
#print otu_count
#now append the read numbers to df3 dataframe
df3 = df3.append(reads)
#print df3
total_reads = df3.sum(axis=0)
#print total_reads
df3.to_csv('figure6.ion_synmock.txt', sep='\t')

!amptk heatmap -i figure6.ion_synmock.txt -d tsv -o figure6_ion.pdf --format pdf -m heatmap --annotate --color YlOrBr --vmax 17000
peterjc commented 2 years ago

Thank you! My primary interest here is the synthetic spike in control (SynMock) as used with the Illumina data shown in Figure 6, so I think I only need to focus on the m6 Illumina files.