Obtaining abundance read counts from DADA2 analysis

hwfes commented 4 years ago

Dear Penn State Microbiome Center DAWG:

Just finished a DADA2 analysis of fungal reads using the app on Galaxy (probably should have done it via R but thought I could help Galaxy trouble shoot its DADA2 app). The outcome was very frustrating in that I only obtained ASV sequences with assigned taxonomy. The thing we really need is this information with sequence counts (i.e., abundance) by sample and ASV.

Please suggest some ways to proceed in obtaining sequence counts (i.e., abundance) by sample and ASV starting with sequence data files from DADA2 output?

It is the first time I have used the DADA2 analysis approach, which means I do not know everything about what data it is producing. I am hoping to not need to go back to the beginning of DADA2 and repeat everything I have done in Galaxy with R.

Data sets in text format (not data frames) I have include: 1) tabular assign taxonomy, 2) remove bimeras sequence table, 3) sequence table, 4) merge pairs, etc. Note that sequences are not generated in fasta format by DADA2 (what were they thinking!!).

I notice that the remove bimeras sequence table has a bunch of individual numbers below each sequence (i.e., ASV) whose tally comes to 158, or the number of samples analyzed. Are these the abundance values for a particular sample and ASV? If so, please suggest a way to exact that information.

Note: I am a Penn State and could set up a Zoom meeting if you would like. I cannot currently attend the DAWG Zoom office hours due to teaching during that time period.

Thank you, Howard

maracashay commented 4 years ago

Hello Howard,

Thanks for reaching out to us.

I don't think you'll need to re-run the processing steps because I think you already have what you need (abundance table). The file that you got after removing bimeras is your abundance table. If you were to open up the file in R you would see the ASV sequences are in the columns and the rows should be your samples and then the numbers represent the abundances.

"The outcome was very frustrating in that I only obtained ASV sequences with assigned taxonomy."

I would check the number of ASVs in the sequence table output from the remove bimeras code with the number of ASVs in the 'assigned taxonomy' table. They should be the same. I've ran DADA2 now on multiple 16S and ITS datasets and haven't lost any ASVs during the assign taxonomy step.

"Data sets in text format (not data frames) I have include: 1) tabular assign taxonomy, 2) remove bimeras sequence table, 3) sequence table, 4) merge pairs, etc. Note that sequences are not generated in fasta format by DADA2 (what were they thinking!!)."

I think there are some reasons why the output isn't fasta format, mostly in that the taxa table and sequencing table can easily be placed into a phyloseq object and then analyses can easily be performed in R.

But, there is a relatively easy way to get the sequences into a fasta file. First through, I would recommend making a phyloseq object, assigning ASV#'s to replace the sequences and then outputting the fasta file. The resulting code would look like this:

note that it is not necessary to include the 'sample_data' here. phy <- merge_phyloseq(tax_table(taxa), otu_table(seqtab.nochim, taxa_are_rows = FALSE), sample_data(map)) dna <- Biostrings::DNAStringSet(taxa_names(phy)) names(dna) <- taxa_names(phy) phy.new <- merge_phyloseq(phy, dna) taxa_names(phy.new) <- paste0("ASV", seq(ntaxa(phy))) phy.new

now the phyloseq object has been made, we can easily export a fasta file for the sequences like so:

uniquesToFasta(seqtab.nochim, fout='rep-seqs.fna', ids=colnames(data.frame(phy@otu_table)))

"I notice that the remove bimeras sequence table has a bunch of individual numbers below each sequence (i.e., ASV) whose tally comes to 158, or the number of samples analyzed. Are these the abundance values for a particular sample and ASV? If so, please suggest a way to exact that information."

I can't tell you specifically what those values are without seeing the code that you ran. If you think this is important, please post the code.

We would also be happy to zoom with you if you think that would be helpful.

hwfes commented 4 years ago

Hi Mara:

Thanks for the info. Confirming that the "bimeras" output had the abundance data really helped. The abundance data by sample and ASV is really important for us in order to examine relationship between tree R gene haplotypes and abundance of fungal ASVs.

I don't know why I thought converting to fasta was going to be difficult. I appreciate the code, but actually I just imported the data into a JMP data table, got rid of the necessary columns, added an ASV number ID, which I needed for other things, along with ">" using find-replace, and saved the JMP data table as text. I also need the JMP data table for the analyses described above.

Have a good day, Howard

From: Mara Cloutier notifications@github.com Sent: Monday, April 27, 2020 6:02 PM To: maracashay/DAWG-Helpline DAWG-Helpline@noreply.github.com Cc: Fescemyer, Howard William hif1@psu.edu; Author author@noreply.github.com Subject: Re: [maracashay/DAWG-Helpline] Obtaining abundance read counts from DADA2 analysis (#2)

Hello Howard,

Thanks for reaching out to us.

I don't think you'll need to re-run the processing steps because I think you already have what you need (abundance table). The file that you got after removing bimeras is your abundance table. If you were to open up the file in R you would see the ASV sequences are in the columns and the rows should be your samples and then the numbers represent the abundances.

"The outcome was very frustrating in that I only obtained ASV sequences with assigned taxonomy."

I would check the number of ASVs in the sequence table output from the remove bimeras code with the number of ASVs in the 'assigned taxonomy' table. They should be the same. I've ran DADA2 now on multiple 16S and ITS datasets and haven't lost any ASVs during the assign taxonomy step.

"Data sets in text format (not data frames) I have include: 1) tabular assign taxonomy, 2) remove bimeras sequence table, 3) sequence table, 4) merge pairs, etc. Note that sequences are not generated in fasta format by DADA2 (what were they thinking!!)."

I think there are some reasons why the output isn't fasta format, mostly in that the taxa table and sequencing table can easily be placed into a phyloseq object and then analyses can easily be performed in R.

But, there is a relatively easy way to get the sequences into a fasta file. First through, I would recommend making a phyloseq object, assigning ASV#'s to replace the sequences and then outputting the fasta file. The resulting code would look like this:

note that it is not necessary to include the 'sample_data' here.

phy <- merge_phyloseq(tax_table(taxa), otu_table(seqtab.nochim, taxa_are_rows = FALSE), sample_data(map)) dna <- Biostrings::DNAStringSet(taxa_names(phy)) names(dna) <- taxa_names(phy) phy.new <- merge_phyloseq(phy, dna) taxa_names(phy.new) <- paste0("ASV", seq(ntaxa(phy))) phy.new

now the phyloseq object has been made, we can easily export a fasta file for the sequences like so:

uniquesToFasta(seqtab.nochim, fout='rep-seqs.fna', ids=colnames(data.frame(phy@otu_table)))

"I notice that the remove bimeras sequence table has a bunch of individual numbers below each sequence (i.e., ASV) whose tally comes to 158, or the number of samples analyzed. Are these the abundance values for a particular sample and ASV? If so, please suggest a way to exact that information."

I can't tell you specifically what those values are without seeing the code that you ran. If you think this is important, please post the code.

We would also be happy to zoom with you if you think that would be helpful.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmaracashay%2FDAWG-Helpline%2Fissues%2F2%23issuecomment-620259318&data=02%7C01%7Chif1%40psu.edu%7Ca7673449e2704ef7795308d7eaf6afb3%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C637236217542633803&sdata=BBqxxmBFnNjiepaV2NkaWKR6IXPeOdfw%2Blwze8xvh9s%3D&reserved=0, or unsubscribehttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAPLPBQPYYF2LADU7NQCOBZ3ROX6HRANCNFSM4MR6C7TA&data=02%7C01%7Chif1%40psu.edu%7Ca7673449e2704ef7795308d7eaf6afb3%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C637236217542643798&sdata=1GKh94NLBFKut3VXbKdufeo%2FS99nw2rUz1lBs1RoqZU%3D&reserved=0.

maracashay / DAWG-Helpline

Obtaining abundance read counts from DADA2 analysis #2