joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 187 forks source link

Plotting barplots with category names starting with 0 #488

Closed swuyts closed 9 years ago

swuyts commented 9 years ago

Hello,

I've been playing around with Phyloseq today and am very pleased with the package for now! I'm using the Restroom Biogeography page as a way to explore Phyloseq, using my own data.

Unfortunately I've ran into some trouble trying to plot "Figure 1 Part A (remake), attempt 2" which uses the plot_bar command. It took me a long time to realize that it had something to do with the names that I gave to my categories after merging the data (instead of grouping the by "SURFACE", I am grouping them by "Day"). I've named them ("01","02","03",...,"13","17","21","56').

When they are named like this I get the following plot: screenshot from 2015-06-14 17 47 13

As you can see all the samples starting with a zero are not shown in the plot, while they are still there in the dataframe. I've figured out that the zero in the name had something to do with this so I added the letter D in front of every category name resulting in: screenshot from 2015-06-14 17 50 17

Which is the expected result.

Have you heard about this issue before?

Cheers!

michberr commented 9 years ago

Hi,

The plot_bar() function calls psmelt() which then calls melt() from the reshape2 package. The melt function simply takes your data from wide format (where rows are OTUs and columns are samples) to long format (where OTU names, samples, and OTU abundance are all columns). This is the standard format for utilizing the graphing abilities of ggplot2. It is during the melt step that all of your samples with a leading 0 are converted to a version without a leading 0. Since these samples no longer match any samples in your sample data they are not present in the plot.

That is the diagnosis for the unexpected behavior, but unfortunately it doesn't offer a solution. However, it seems like you already found a good solution for changing your sample names so I would stick with that. And beware that many popular r functions may have unexpected behavior if your column names start with 0 or have special characters etc. Hope that helps.

Best, Michelle

swuyts commented 9 years ago

Hey Michelle,

Thanks for this! I've been using reshape2 for ggplot2 and did not know that this was a common 'problem'. This is very useful information for labeling samples in the future.

Thank you.

Cheers, Sander

joey711 commented 9 years ago

Thanks, @michberr great answer.

It is a blessing and curse that ggplot2 will attempt to understand what form (e.g. continuous, categorical) of axis you want from the data type. It can be a problem for various date-endcodings as well, for example.

To be honest, I'm a little surprised that it converted your character to a continuous scale, but this would be expected behavior if the sample IDs had been R integers from the outset.

Either way, you can avoid this by including ID values that begin with a letter, as @michberr pointed out.

Issue closed! great job :)

joey

burkesquires commented 7 years ago

Hi @joey711, all, We (the NIAID Nephele microbiome analysis portal team) have encountered the same issue when users have integers as identifiers. Is there any change to the advise above to add a character to the beginning of the identifiers?

spholmes commented 7 years ago

We not only have this problem with ggplot2 but also ade4 and vegan: please do not give sample names starting with numbers of at all possible, some programs actually automoatically add an X in front, mostly they don't so this advice still stands and is unfortunately difficult to fix.

Susan

On Fri, May 26, 2017 at 11:09 AM, R. Burke Squires <notifications@github.com

wrote:

Hi @joey711 https://github.com/joey711, all, We (the NIAID Nephele microbiome analysis portal team) have encountered the same issue when users have integers as identifiers. Is there any change to the advise above to add a character to the beginning of the identifiers?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/488#issuecomment-304351740, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvSjMdhRd1oh_roqAkg-EGUDYi8p3ks5r9xVJgaJpZM4FCpje .

-- Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

burkesquires commented 7 years ago

Thank you Susan. Just making sure I was not missing something obvious...or nearly obvious! :-)

burkesquires commented 7 years ago

Just noting that the issue affects floating point identifiers as well. The only solutions seems to be the addition of some text to the beginning or end of each identifier.

spholmes commented 7 years ago

Yes, that is consistent with the requirements of ade4 and vegan in particular for the naming of rows. Susan

On Tue, Jun 13, 2017 at 11:37 AM, R. Burke Squires <notifications@github.com

wrote:

Just noting that the issue affects floating point identifiers as well. The only solutions seems to be the addition of some text to the beginning or end of each identifier.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/488#issuecomment-308209128, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvQ0-Nchykp_Ea4TuGS7pLa0U98Akks5sDtbkgaJpZM4FCpje .

-- Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/