joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
582 stars 187 forks source link

merge_samples hiccup #243

Closed dmap02 closed 11 years ago

dmap02 commented 11 years ago

I've been trying to merge replicate samples using the merge_samples function in phyloseq. I'm starting with a phyloseq object called "filtered".

If I run this:

map <- sample_data(filtered) dereplicate <- map$DeReplicate derep = merge_samples(filtered, dereplicate, fun=sum)

Two things happen that perhaps shouldn't.

  1. The sample names change. The original object was "filtered" and the output object was "derep". The change in sample names is shown below. I think it's renaming the samples by the dereplicate variable.
sample_names(filtered)
   [1] "Run1.P1.1.03B.D2.R0"   "Run1.P1.1.14B.D8.R0"   "Run1.P1.1.30L.D4.R0"   "Run1.P1.1.25B.D5.R0"  
   [5] "Run1.P1.1.30B.D2.R0"   "Run1.P1.1.19L.D6.R0"   "Run1.P1.1.30B.D3.R0"   
head(sample_names(derep))
[1] "P1-1.03B.D1" "P1-1.03B.D2" "P1-1.03B.D3" "P1-1.03B.D4" "P1-1.03B.D5" "P1-1.03B.D6"
  1. The mapping file changes. Categorical variables get converted to continuous variables, and then summed as well.
head(sample_data(filtered))
Sample Data:        [6 samples by 19 sample variables]:
                             X.SampleID BarcodeSequence  LinkerPrimerSequence Plate  Run       Name Habitat Well
Run1.P1.1.03B.D2.R0 Run1.P1.1.03B.D2.R0    AACTCGTCGATG GCACTCCTACGGGAGGCAGCA     1 Run1 338F_BC002     03B   A2
Run1.P1.1.14B.D8.R0 Run1.P1.1.14B.D8.R0    ACTATTGTCACG GCACTCCTACGGGAGGCAGCA     1 Run1 338F_BC048     14B  D12
head(sample_data(derep))
Sample Data:        [6 samples by 19 sample variables]:
            X.SampleID BarcodeSequence LinkerPrimerSequence Plate Run Name Habitat Well PrimerAKey Tooth Jaw
P1-1.03B.D1          1               1                    1     1   1    1       2    1          1    19   2
P1-1.03B.D2          2               2                    1     1   1    2       2    5          1    19   2

I tried to run the merge_samples on the OTU table independently of the mapping file and tre in order to circumvent the summing of the mapping file, but since the sample_names change I can't merge it with the tree...

Any suggestions?

joey711 commented 11 years ago

Hey @dmap02 !

(1) This is intended behavior. Rather than arbitrarily pick one of the sample names to represent a group of merged samples, I instead make the new samples named according to the data they were merged-by. Anyway, the sample names have to change because you're changing the number of samples and what they represent. ... Actually, I just realized in your example you didn't merge by a variable in your sample_data. Bad! I forgot I even allowed this option. Better practice to have the merging data be part of the included sample_data, rather than some vector.

(2) They're not so much continuous variables as they are integers. They're actually coerced to integers from factors, which are internally represented as integers anyway, with special decorations. Exciting, I know. Anyway, it takes just one line to "repair" the damage done by merging, and there's not a simple fix for me to change the behavior of merging, since it is a side-effect of the merge function in base R, which is robust enough to not fail when you attempted to merge factors. I'll think of a way to make this easier to deal with. Meanwhile, the following will "repair" the factor that you merged-by, for example:

sample_data(derep)$DeReplicate <- factor(sample_names(derep))

Since the integers in sample_data(derep) are derived from the original factors, you can also use them on the right-hand side of the replacement/creation statements for the sample data (sample_data(derep)$something <-).

Hope that helps. Let me know if there is further confusion, or if you have suggestions to make this better. Also, try some examples from merge on data.frames in base R.

dmap02 commented 11 years ago

Thanks @joey711! I changed my approach to use the variable dereplicate from within the mapping file, as you suggested. As for improving the merge_samples function, it might be nice to see the merged sample names renamed to include how many samples were merged together or something, so if the dereplication field is "Subject1_Site1_Date" It might be nice for the new sample name to be "Subject1_Site1_Date_Merge_N" where N is the number of samples that were merged. This would be a wonderful verification step for me, but not necessary. In any case, I'm very happy to have this function. Thanks again for your help!

lpitombo commented 10 years ago

Hi @joey711,

I'm starting either with R and Phyloseq. So, I have spent a lot of time to make the best tree plot.

I would like turn clear the treatment and day effect using color as function of day and shape as function of treatment, but after I use "merge_samples" in an colunm named Treatment_day, the treatment column turn numbers, so I can't use it as shape.

It is possible before merge_samples, but I loose graphic resolution since the graphic include all replicates.

With your help, I could create another colunm with "sample_names", but in my case the "sample_names" represents treatment and day, and not only treatment.

Is there some method to replace the names in this new column? I know do it using common data.frame functions (as sample_data is data.frame) but it doesn't work in a phyloseq object.

Thanks, Leonardo

lpitombo commented 10 years ago

I'm here again....

After many days, I could fix it creating another "sample_data" and using "merge_phyloseq"...

Thanks a lot! Leonardo

joey711 commented 10 years ago

Leonardo,

The merge_samples function ends up mangling some variables after the merge, which is the default behavior of the merge function in base R. The code to "repair" the relevant sample_data variables is demonstrated above in a case where the sample_names have become the same as the desired variable.

sample_data(derep)$DeReplicate <- factor(sample_names(derep))

In your case you are interested in other variables, too, and it sounds like some of these have also been transformed.

If you provide code and data it might be possible for me to demonstrate a solution. Should be a one line assignment similar to the one shown above, only with a different right-hand side.

I'm glad it sounds like you found a workaround for your problem. Can you post the code you used so that others may benefit?

Thanks for the feedback, and Happy Holidays

joey

lpitombo commented 10 years ago

Hi Joey,

My original sample_data file has the columns:

SampleID; Treatment_day; day; Treatment; Vinasse; Straw; Inorganic_N; organicC.

I used this code to merge the replicates from the same day and treatment: MergedPiracicabaSamples<-merge_samples(MergedPiracicaba, "Treatment_day", fun=mean)

So, all columns gave me continous data. It was no problem for the other variable, but I would like use treatment as shape in my tree plots.

If I use the code above to fix my file, it give me a column with Treatment_day codes, but I would like use Treatment.

So, I created sample_data2 file, with the same X.SampleID plus my tretatment column. I read it:

sample_data_to_merge <- import_qiime_sample_data(file.choose())

After that, I merged the previous phyloseq object with this new one:

treatmentCorrected<-merge_phyloseq(MergedPiracicabaSamplesFilter, sample_data_to_merge)

And now I can plot trees using day as color and shape as treatment!

Cheers, Leonardo

barbara1 commented 9 years ago

I also have a similar problem from merging. Factors with names for levels are converted to integers. It seems simple to covert the integer to factor levels by: factor (sample_data(Dt_creps)$Horizon) [1] 1 2 1 2 1 2 1 2 1 2 1 2 2 1 2 1 2 1 2 1 2 1 1 2 1 2 1 2 1 2 2 1 2 1 2 1 Levels: 1 2 But I have not found a simple way to convert the numerical factors back to the original names. Any suggestions? Thanks

sample_data(DT_count)[1:10] Sample Data: [10 samples by 9 sample variables]: SampleID Replicated SiteID Block Season Stand Horizon x_coord_UTM L101 L101 Sample_1 Site_1 Block_1 Fall1 Spruce L 619910 L103 L111 Sample_1 Site_1 Block_1 Winter Spruce L 619910 L105 L121 Sample_1 Site_1 Block_1 Spring Spruce L 619910 L111 L131 Sample_1 Site_1 Block_1 Summer Spruce L 619910 L113 L141 Sample_1 Site_1 Block_1 Fall2 Spruce L 619910 L115 S102 Sample_2 Site_1 Block_1 Fall1 Spruce S 619910 L121 S112 Sample_2 Site_1 Block_1 Winter Spruce S 619910

Dt_creps = merge_samples(DT_count, "Replicated") sample_data(Dt_creps)[1:10] Sample Data: [10 samples by 9 sample variables]: SampleID Replicated SiteID Block Season Stand Horizon x_coord_UTM Sample_1 7 1 1 1 3 3 1 619910 Sample_10 113 2 14 2 3 1 2 618282 Sample_11 24 3 15 2 3 2 1 617512 Sample_12 114 4 15 2 3 2 2 617512 Sample_13 37 5 16 3 3 3 1 620202

orenkolodny commented 7 years ago

Perhaps worth writing something about this issue in the command documentation. It took me half an hour of trying to work this out until I found this thread... Thanks!

joey711 commented 7 years ago

That's a fair request. I won't re-open this issue, but I will note the doc update as a requirement for closing #608

orenkolodny commented 7 years ago

Thanks!

On Wed, Mar 15, 2017 at 10:44 AM, Paul J. McMurdie <notifications@github.com

wrote:

That's a fair request. I won't re-open this issue, but I will note the doc update as a requirement for closing #608 https://github.com/joey711/phyloseq/issues/608

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/243#issuecomment-286823011, or mute the thread https://github.com/notifications/unsubscribe-auth/AYoin2H427C6LW8o53j5czvL-r6MLF5Oks5rmCNugaJpZM4A8CEq .