spholmes / F1000_workflow

43 stars 33 forks source link

Component sample names do not match. Try sample_names(). #27

Closed abalter closed 6 years ago

abalter commented 6 years ago

This section of the workflow is throwing an error:

library(phyloseq)
ps <- phyloseq(otu_table(seqtabNoC, taxa_are_rows=FALSE), 
               sample_data(samdf), 
               tax_table(taxTab),phy_tree(fitGTR$tree))
ps <- prune_samples(sample_names(ps) != "Mock", ps) # Remove mock sample
ps

The error is:

Error in validObject(.Object) : invalid class “phyloseq” object: Component sample names do not match. Try sample_names()

I'm pretty sure I've gotten through the sample workflow without this error. Can someone tell me exactly which row or column names are supposed to be matching but are not?

These are the rownames of my variables:

> rownames(otu_table(seqtabNoC, taxa_are_rows=FALSE))
 [1] "F3D0"   "F3D1"   "F3D141" "F3D142" "F3D143" "F3D144" "F3D145" "F3D146" "F3D147" "F3D148" "F3D149" "F3D150"
[13] "F3D2"   "F3D3"   "F3D5"   "F3D6"   "F3D7"   "F3D8"   "F3D9"  
> rownames(sample_data(samdf))
 [1] "NA"    "NA.1"  "NA.2"  "NA.3"  "NA.4"  "NA.5"  "NA.6"  "NA.7"  "NA.8"  "NA.9"  "NA.10" "NA.11" "NA.12"
[14] "NA.13" "NA.14" "NA.15" "NA.16" "NA.17" "NA.18"
> rownames(phy_tree(fitGTR$tree))
NULL
benjjneb commented 6 years ago

Can someone tell me exactly which row or column names are supposed to be matching but are not?

The samdf data.frame should have rownames matching those of the otu_table, e.g. "F3D0", "F3D1", ...

abalter commented 6 years ago

Hey @benjjneb -- Still having a problem. Somehow, the samdf object is getting, to put it technically, messed up.

> samdf <- read.csv("http://raw.githubusercontent.com/spholmes/F1000_workflow/master/data/MIMARKS_Data_combined.csv",header=TRUE)
> head(samdf)
> #summary(samdf)
> samdf$SampleID <- paste0(gsub("00", "", samdf$host_subject_id), "D", samdf$age-21)
> #summary(samdf)
> samdf <- samdf[!duplicated(samdf$SampleID),] # Remove dupicate entries for reverse reads
> head(samdf)
> #summary(samdf)
> rownames(seqtabAll) <- gsub("124", "125", rownames(seqtabAll)) # Fix discrepancy
> all(rownames(seqtabAll) %in% samdf$SampleID) # TRUE
[1] TRUE
> rownames(samdf) <- samdf$SampleIDlib
> keep.cols <- c("collection_date", "biome", "target_gene", "target_subfragment",
+ "host_common_name", "host_subject_id", "age", "sex", "body_product", "tot_mass",
+ "diet", "family_relationship", "genotype", "SampleID") 
> print(keep.cols)
 [1] "collection_date"     "biome"               "target_gene"         "target_subfragment" 
 [5] "host_common_name"    "host_subject_id"     "age"                 "sex"                
 [9] "body_product"        "tot_mass"            "diet"                "family_relationship"
[13] "genotype"            "SampleID"           
> samdf <- samdf[rownames(seqtabAll), keep.cols]
> head(samdf)
> #summary(samdf)
> rownames(samdf)
 [1] "NA"    "NA.1"  "NA.2"  "NA.3"  "NA.4"  "NA.5"  "NA.6"  "NA.7"  "NA.8"  "NA.9"  "NA.10"
[12] "NA.11" "NA.12" "NA.13" "NA.14" "NA.15" "NA.16" "NA.17" "NA.18"
> all(rownames(seqtabAll) %in% samdf$SampleID) # TRUE
[1] FALSE
benjjneb commented 6 years ago

rownames(samdf) <- samdf$SampleIDlib

Might just be a typo? I.e., remove the extraneous "lib" from the end of that line.

abalter commented 6 years ago

Just figured that out! Doh!