joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:

http://joey711.github.io/phyloseq/

582 stars 187 forks source link

Reading in data and merging #316

Closed annidjurhuus closed 10 years ago

annidjurhuus commented 10 years ago

When using the import_biom and sample_data functions is it possible to merge these to with the merge_phyloseq function?

I have some problems getting my metadata and phylogenetic data into the same dataframe, which means it is not possible for me to add labels to the figures appropriately (e.g. by month/year/geographical location). When I read in the sample data and the biom file and merge them with the merge_phyloseq all the columns of the metadata are not correctly read in and I don't get all the data necessary for analysis.

Am I supposed to join them before reading them into R or is there an essential function I have missed?

My code is:

biom <-import_biom("biom_file", sam_data)

sampledata <- sample_data(data.frame(Location = sample("metadata", size = nsamples(biom), replace = TRUE), size = nsamples(biom), replace = TRUE,row.names = sample_names(biom), stringsAsFactors = FALSE))

df <- merge_phyloseq(sampledata,biom)

This is my output from the dataframe: phyloseq-class experiment-level object otu_table() OTU Table: [ 43143 taxa and 181 samples ] sample_data() Sample Data: [ 181 samples by 3 sample variables ] tax_table() Taxonomy Table: [ 43143 taxa by 7 taxonomic ranks ]

The problem happens when I read in the sample data, not with the merging, however, I am not entirely sure what I am doing wrong.

Thank you very much for your help!

Best wishes, Anni

joey711 commented 10 years ago

Anni,

So far you have not described a problem. You call the phyloseq object in your example df, but its print method clearly states it is a "phyloseq-class experiment-level object" with 181 samples, and three components, including your sample_data. It appears that merging worked and that you have a valid phyloseq object with which you can try many other things in phyloseq.

The fact that you have to do this merging at all is most-often the result of a deficiency in QIIME, in which available sample covariates in the mapping file are not included in the biom-file output, even though the biom-format is specifically designed to be able to include that information. With tools like merge_phyloseq it is only an extra step, usually two lines of code, to remedy this. If you lost samples (or OTUs) during the merge it would be because they did not match exactly the sample or OTU names in the corresponding OTU table component. The phyloseq constructor checks and only keeps the union of sample and OTU index names in all the components, so that if you have a valid phyloseq-object like your df above (again, not a data frame), then each of its components describe exactly the same OTUs and/or samples.

As a side note, I would be suspicious of a dataset with 181 samples and more than 40,000 OTUs. That is most-likely too many. Have you tried UPARSE?

I will close this issue for now, unless you can describe a specific problem you are having.

Thanks for the feedback and your interest in phyloseq!

joey

annidjurhuus commented 10 years ago

Dear Joey, I apologise. My issue was that I read in the whole dataset (biom and mapping file) but didn't get all the columns of my mapping file in. My mapping file has 69 columns with different variables but I only get 3 into R. Do you know how I could get the whole file into R including all the columns? I will have a second look at the OTU's thank you for recommending UPARSE. Best wishes,Anni Date: Wed, 26 Mar 2014 09:14:27 -0700 From: notifications@github.com To: phyloseq@noreply.github.com CC: annidjurhuus88@hotmail.com Subject: Re: [phyloseq] Reading in data and merging (#316)

Anni,

So far you have not described a problem. You call the phyloseq object in your example df, but its print method clearly states it is a "phyloseq-class experiment-level object" with 181 samples, and three components, including your sample_data. It appears that merging worked and that you have a valid phyloseq object with which you can try many other things in phyloseq.

The fact that you have to do this merging at all is most-often the result of a deficiency in QIIME, in which available sample covariates in the mapping file are not included in the biom-file output, even though the biom-format is specifically designed to be able to include that information. With tools like merge_phyloseq it is only an extra step, usually two lines of code, to remedy this. If you lost samples (or OTUs) during the merge it would be because they did not match exactly the sample or OTU names in the corresponding OTU table component. The phyloseq constructor checks and only keeps the union of sample and OTU index names in all the components, so that if you have a valid phyloseq-object like your df above (again, not a data frame), then each of its components describe exactly the same OTUs and/or samples.

As a side note, I would be suspicious of a dataset with 181 samples and more than 40,000 OTUs. That is most-likely too many. Have you tried UPARSE?

I will close this issue for now, unless you can describe a specific problem you are having.

Thanks for the feedback and your interest in phyloseq!

joey

— Reply to this email directly or view it on GitHub.

joey711 commented 10 years ago

Anni,

I would need to see the code/file to answer your question. Typically people use either read.table or read.csv to read tables of data into R, but these are simply the common cases, and even these two functions have many options. There are lots and lots of ways to read data into R, and it depends on your file format. I have no way to know from your comment why 3 out of 60+ columns are being omitted during the file parsing. Check your file for strange characters, comment characters, quotes (or lack of quotes), etc. This should be a pretty straightforward issue to solve. Google search "R import table" and you should find a plethora of examples.

Hope that helps!

joey

annidjurhuus commented 10 years ago

The code I use to read in the sample data is the code specified in the first post: biom <-import_biom("biom_file", sam_data) #Reading in the biom/OTU table

Reading in the sample datasampledata <- sample_data(data.frame(Location = sample("metadata", size = nsamples(biom), replace = TRUE), size = nsamples(biom), replace = TRUE,row.names = sample_names(biom), stringsAsFactors = FALSE))df <- merge_phyloseq(sampledata,biom)

When using R to read in CSV files it doesn't work to merge the biom and mapping file because the mapping file is not a phyloseq object, I presume. But when reading it in with the sample_data function it works fine to merge them, it just doesn't work to read in the whole data frame with the the sample_data function. I guess my question is now, how would I make a csv file into a phyloseq object so that I can merge it with the biom file? Cheers, Anni Date: Wed, 26 Mar 2014 09:59:53 -0700 From: notifications@github.com To: phyloseq@noreply.github.com CC: annidjurhuus88@hotmail.com Subject: Re: [phyloseq] Reading in data and merging (#316)

Anni,

I would need to see the code/file to answer your question. Typically people use either read.table or read.csv to read tables of data into R, but these are simply the common cases, and even these two functions have many options. There are lots and lots of ways to read data into R, and it depends on your file format. I have no way to know from your comment why 3 out of 60+ columns are being omitted during the file parsing. Check your file for strange characters, comment characters, quotes (or lack of quotes), etc. This should be a pretty straightforward issue to solve. Google search "R import table" and you should find a plethora of examples.

Hope that helps!

joey

— Reply to this email directly or view it on GitHub.

joey711 commented 10 years ago

"When using R to read in CSV files it doesn't work to merge the biom and mapping file because the mapping file is not a phyloseq object, I presume"

You presume incorrectly. Any data object in R can be coerced into a format for phyloseq. In this case, use the sample_data function on the "data.frame"-class object that you create when you read the csv file using read.csv. Also note that you already did this without knowing it in your code above (in the sample_data(data.frame( ... )) call).

physeq = import_biom("biom_file")
# Define a data frame from your csv table
sampleDF = read.csv("path/sampleDataFile.csv", row.names=1)
# Check that the rownames match the sample names
all(rownames(sampleDF) %in% sample_names(physeq))
# Convert to "sample_data" class
sampledata = sample_data(sampleDF)
# Now merge.
physeq2 = merge_phyloseq(sampledata, physeq)

Hope that helps! Obviously you will need to modify the file path/name for your data, and you may have to tweak the arguments to read.csv or use read.table, depending on your file.

p.s. You should review the documentation for the sample function. It is not something you want to be using when you are assigning ID values to a table...

annidjurhuus commented 10 years ago

Hi Paul, I guessed so. I am not used to working with phyloseq objects or the package at all. I have a deadline that I am working towards and my previous R tricks wouldn't work for me. Thank you for the great help. I have at least learned the most important thing I did wrong. Cheers, Anni

Date: Wed, 26 Mar 2014 11:03:17 -0700 From: notifications@github.com To: phyloseq@noreply.github.com CC: annidjurhuus88@hotmail.com Subject: Re: [phyloseq] Reading in data and merging (#316)

"When using R to read in CSV files it doesn't work to merge the biom and mapping file because the mapping file is not a phyloseq object, I presume"

You presume incorrectly. Any data object in R can be coerced into a format for phyloseq. In this case, use the sample_data function on the "data.frame"-class object that you create when you read the csv file using read.csv. Also note that you already did this without knowing it in your code above (in the sample_data(data.frame( ... )) call).

physeq = import_biom("biom_file")

Define a data frame from your csv table

sampleDF = read.csv("path/sampleDataFile.csv", row.names=1)

Check that the rownames match the sample names

all(rownames(sampleDF) %in% sample_names(physeq))

Convert to "sample_data" class

sampledata = sample_data(sampleDF)

Now merge.

physeq2 = merge_phyloseq(sampledata, physeq)

Hope that helps! Obviously you will need to modify the file path/name for your data, and you may have to tweak the arguments to read.csv or use read.table, depending on your file.

p.s. You should review the documentation for the sample function. It is not something you want to be using when you are assigning ID values to a table...

— Reply to this email directly or view it on GitHub.

Aanuoluwaduro commented 1 year ago

"When using R to read in CSV files it doesn't work to merge the biom and mapping file because the mapping file is not a phyloseq object, I presume"

You presume incorrectly. Any data object in R can be coerced into a format for phyloseq. In this case, use the sample_data function on the "data.frame"-class object that you create when you read the csv file using read.csv. Also note that you already did this without knowing it in your code above (in the sample_data(data.frame( ... )) call).
physeq = import_biom("biom_file")
# Define a data frame from your csv table
sampleDF = read.csv("path/sampleDataFile.csv", row.names=1)
# Check that the rownames match the sample names
all(rownames(sampleDF) %in% sample_names(physeq))
# Convert to "sample_data" class
sampledata = sample_data(sampleDF)
# Now merge.
physeq2 = merge_phyloseq(sampledata, physeq)
Hope that helps! Obviously you will need to modify the file path/name for your data, and you may have to tweak the arguments to read.csv or use read.table, depending on your file.

p.s. You should review the documentation for the sample function. It is not something you want to be using when you are assigning ID values to a table...

This just helped me! Thanks!