joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
580 stars 187 forks source link

merge_samples coerces factors to integers: Feature request #608

Open davidvilanova opened 8 years ago

davidvilanova commented 8 years ago

Hi, merge_samples function will coerce factors to integers in your sample data. This issue was already discussed in #243 .

If you have a sample_data with many different variables it´s get complicated to coerce back to factors. Would it be possible that merge_samples returns sample_data without coercing factors to integers ??

Thanks,

susannuske commented 8 years ago

Hello, yes, I have a similar problem. I understand that converting to integer is a feature of 'merge'. The problem exists when multiple other variables are also coerced into integers as well and these variables are needed in further analysis. Especially when the variables that are needed are converted to the same integer!

Eg: my sample_data is type; season; plot; replicate soil; dry; S1; soildryS1 soil; dry; S1; soildryS1 soil; dry; S2; soildryS2 soil; dry; S2; soildryS2 soil; wet; S1; soilwetS1

when merged by replicate, and renamed replicate by 'sample_names' then, sample_data is: type; season; plot; replicate 1; 1; 1; soildryS1 1; 1; 2; soildryS2 1; 2; 1; soilwetS1

The solution suggested in #243 to merge another sample data set into the phyloseq object requires tedious renaming of variables and therefore shouldn't be recommended as a solution.

Is there a way to add to the merge script also returns the factors into the merged sample_data?

Cheers,

joey711 commented 7 years ago

Note: Linking the related threads on this, and adding some explanation/demo to the FAQ is probably a good requirement for closing this issue.

KMKemp commented 7 years ago

I am so very sorry... but I am still having trouble with this issue and have looked at the referenced issues #243 and still can't fix the factor levels in my sample_data after merging.

for example I now have this:

factor(sample_data(LKTS2015_Rm)$Distance) [1] 1 2 3 1 2 3 1 2 3 4 Levels: 1 2 3 4

I am trying to fix it with the revalue function in library(plyr) like this:

revalue(factor(sample_data(LKTS2015_Rm)$Distance), c("1"="ten.cm", "2" = "two.cm", "3" = "front", "4" = "healthy")) [1] ten.cm two.cm front ten.cm two.cm front ten.cm two.cm front healthy Levels: ten.cm two.cm front healthy

But when I look at this factor again, the levels have not been renamed. I think I am just missing a step but cannot figure it out.

Many thanks for you help with this!

hrogal commented 7 years ago

KMKemp, I think this is the answer you want. sample_data(LKTS2015_Rm)$Distance<- factor(sample_data(LKTS2015_Rm)$Distance, levels = c("1"="ten.cm", "2" = "two.cm", "3" = "front", "4" = "healthy"))

I know I got this from another post by joey711, but I don't remember where.

carolineoj commented 2 years ago

Hi everyone,

Despite the listed solutions, I was still struggling with this issues because I have a large metadata file with lots of variables I wanted to retain after merging samples. I came up with this imperfect workaround that skips the renaming of variables, which was not feasible with my dataset.

Important warning: With this approach, variables that were unique in the original metadata file will no longer appear as NA or a combined value in the new metadata file. It will be filled with a variable selected from one of the merged samples. For example, in the orginal Global Patterns dataset, there are three soil samples, with the metadata column Primers are variables "ILBC_01", "ILBC_02", and "ILBC_03". After the modification below, the value for soil Primer will be "ILBC_01", which doesn't represent a change after merging. If you aren't paying attention and try to use this variable later on, it will cause problems.

If you do downstream analysis, be SURE that you use only variables that were consistent across the samples you merged or you will get incorrect results.

I'm sure there is a better way to do this, or this code could be fixed up to prevent this bad behavior. But I am not very good at coding, so that is a challenge for another day (or person). As is, you could subset the dereplicated dataframe to retain the columns you know are appropriate, before replacing the metadata file in your merged phyloseq object. Also, the sorting step may be redundant, but I wasn't sure, so I kept it.

I'll demo with the GlobalPatterns dataset:

PLEASE READ NOTE ABOVE FOR WARNING BEFORE USING!

#load data
data(GlobalPatterns)
GP = GlobalPatterns
GP = prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)

 #Merge according to factor "SampleType"
derep_GP = merge_samples(GP, "SampleType")

#Create a dereplicated metadata file to retain sample variables
df <- sample_data(GP) #Extract sample data dataframe 
rmreps <- subset(df, !duplicated(SampleType)) # Remove rows with duplicate values by the factor "SampleType" (it will retain 1 of each SampleType value -if associated variables are unique it will pick one)
sorted <- rmreps[order(rmreps$SampleType),] #Sort dataframe in order of "SampleType" to match merged phyloseq object
rownames(sorted) <- sorted$SampleType #Rename rows of dataframe to match the sample_names in the merged phyloseq object

#Replace metadata in merged phyloseq object with populated metadata file
sample_data(derep_GP) <- sorted 

Hope this helps someone else with a big metadata file. If anyone has any improvements or there's a mistake that I missed (yikes), please post.