joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
586 stars 186 forks source link

Importing atypical data #1592

Open lauramason326 opened 2 years ago

lauramason326 commented 2 years ago

Hello, I am trying to import a file into phyloseq that does not look like the files in the tutorials. The samples are repeated mulitple times, as are the sequences. I would like to treat the sequences as OTU_IDs, and somehow manipulate the file to get a "regular" OTU table with samples across the top, OTU_IDs in the first column, and counts within. Can you help me?

        sample                                                     sequence num_hits          X1x                  X2x
1 CSA4R.raw_R1 GTGCCGCTGCCGACGGAGAAGAACGTGTACTGCGTCATCCGCTCGCCGCACAAGTACAAG      999 d__Bacteria; p__Actinobacteriota;
2 CSA1D.raw_R1 GTGCCGCTGCCGACGGAGAAGAACGTGTACTGCGTCATCCGCTCGCCGCACAAGTACAAG      686 d__Bacteria; p__Actinobacteriota;
3 CSC4D.raw_R1 GTGCCGCTGCCGACGGAGAAGAACGTGTACTGCGTCATCCGCTCGCCGCACAAGTACAAG      792 d__Bacteria; p__Actinobacteriota;
4 CSC1D.raw_R1 GTGCCGCTGCCGACGGAGAAGAACGTGTACTGCGTCATCCGCTCGCCGCACAAGTACAAG      626 d__Bacteria; p__Actinobacteriota;
5 COA4R.raw_R1 GTGCCGCTGCCGACGGAGAAGAACGTGTACTGCGTCATCCGCTCGCCGCACAAGTACAAG      526 d__Bacteria; p__Actinobacteriota;
6 CSA2D.raw_R1 GTGCCGCTGCCGACGGAGAAGAACGTGTACTGCGTCATCCGCTCGCCGCACAAGTACAAG      525 d__Bacteria; p__Actinobacteriota;
                X3x                 X4x                    X5x  X6x  X7x
1 c__Actinomycetia; o__Mycobacteriales; f__Geodermatophilaceae <NA> <NA>
2 c__Actinomycetia; o__Mycobacteriales; f__Geodermatophilaceae <NA> <NA>
3 c__Actinomycetia; o__Mycobacteriales; f__Geodermatophilaceae <NA> <NA>
4 c__Actinomycetia; o__Mycobacteriales; f__Geodermatophilaceae <NA> <NA>
5 c__Actinomycetia; o__Mycobacteriales; f__Geodermatophilaceae <NA> <NA>
6 c__Actinomycetia; o__Mycobacteriales; f__Geodermatophilaceae <NA> <NA>

Thanks! Laura

ycl6 commented 2 years ago

Hi @lauramason326

Below I am using the demo dataset GlobalPatterns to first construct a data.frame similar to your data structure. You can adapt the rest of the code to your real data.

library(phyloseq)
library(tidyverse)
library(stringr) # str_pad

data(GlobalPatterns)
df <- GlobalPatterns %>% psmelt()
# Think of OTU as the sequence info in your data
df <- df[,c("Sample","OTU","Abundance","Kingdom","Phylum","Class","Order","Family","Genus","Species")]
> head(df, 10)
         Sample    OTU Abundance  Kingdom         Phylum               Class           Order          Family          Genus                   Species
406582   AQC4cm 549656   1177685 Bacteria  Cyanobacteria         Chloroplast   Stramenopiles            <NA>           <NA>                      <NA>
241435 LMEpi24M 279599    914209 Bacteria  Cyanobacteria    Nostocophycideae      Nostocales     Nostocaceae Dolichospermum                      <NA>
406580   AQC7cm 549656    711043 Bacteria  Cyanobacteria         Chloroplast   Stramenopiles            <NA>           <NA>                      <NA>
406574   AQC1cm 549656    554198 Bacteria  Cyanobacteria         Chloroplast   Stramenopiles            <NA>           <NA>                      <NA>
329873  M31Tong 360229    540850 Bacteria Proteobacteria  Betaproteobacteria    Neisseriales   Neisseriaceae      Neisseria                      <NA>
300794  M11Fcsw 331820    452219 Bacteria  Bacteroidetes         Bacteroidia   Bacteroidales  Bacteroidaceae    Bacteroides                      <NA>
494797  M31Tong  94166    396201 Bacteria Proteobacteria Gammaproteobacteria  Pasteurellales Pasteurellaceae    Haemophilus Haemophilusparainfluenzae
300772  M31Fcsw 331820    354695 Bacteria  Bacteroidetes         Bacteroidia   Bacteroidales  Bacteroidaceae    Bacteroides                      <NA>
298689 SLEpi20M 329744    323914 Bacteria Actinobacteria      Actinobacteria Actinomycetales          ACK-M1           <NA>                      <NA>
114279     TS29 189047    251215 Bacteria     Firmicutes          Clostridia   Clostridiales Ruminococcaceae           <NA>                      <NA>

Below I convert Sample and OTU to factor, rename OTU ID to ASV_xxxxx and sort the data.frame.

df$Sample <- as.factor(df$Sample)
df$OTU <- as.factor(df$OTU)
levels(df$OTU) <- paste0("ASV_", str_pad(1:nlevels(df$OTU), nchar(nlevels(df$OTU)), pad = "0"))
df <- df[order(df$Sample, df$OTU),]
> head(df, 10)
    Sample       OTU Abundance  Kingdom         Phylum               Class             Order            Family             Genus Species
11  AQC1cm ASV_00001         0 Bacteria Actinobacteria      Actinobacteria   Actinomycetales Microbacteriaceae          Cryocola    <NA>
36  AQC1cm ASV_00002         0 Bacteria Proteobacteria Alphaproteobacteria  Sphingomonadales Sphingomonadaceae              <NA>    <NA>
65  AQC1cm ASV_00003         0 Bacteria Proteobacteria Deltaproteobacteria Desulfobacterales  Desulfobulbaceae              <NA>    <NA>
88  AQC1cm ASV_00004         1 Bacteria Proteobacteria Gammaproteobacteria     Thiotrichales              <NA>              <NA>    <NA>
128 AQC1cm ASV_00005         0 Bacteria Proteobacteria Gammaproteobacteria     Thiotrichales    Thiotrichaceae         Thiothrix    <NA>
144 AQC1cm ASV_00006         0 Bacteria     Firmicutes          Clostridia     Clostridiales   Lachnospiraceae      Ruminococcus    <NA>
165 AQC1cm ASV_00007         1 Bacteria Proteobacteria  Betaproteobacteria     Rhodocyclales    Rhodocyclaceae Methyloversatilis    <NA>
198 AQC1cm ASV_00008         1 Bacteria  Bacteroidetes       Flavobacteria              <NA>              <NA>              <NA>    <NA>
230 AQC1cm ASV_00009         1 Bacteria Proteobacteria  Betaproteobacteria   Burkholderiales    Comamonadaceae              <NA>    <NA>
246 AQC1cm ASV_00010        11 Bacteria Proteobacteria  Betaproteobacteria   Burkholderiales  Burkholderiaceae       Cupriavidus    <NA>

They we build the required data for phyloseq. In sam, you can add other sample variables you have to the data.frame at this point.

# sample_data
sam <- data.frame(SampleID = levels(df$Sample), row.names = levels(df$Sample))

# tax_table
tax <- unique(df[,c("OTU","Kingdom","Phylum","Class","Order","Family","Genus","Species")])
rownames(tax) <- tax$OTU
tax <- as.matrix(tax[,-1]) # remove OTU column

# otu_table
otu <- df %>% select(OTU, Sample, Abundance) %>% spread(Sample, Abundance)
rownames(otu) <- otu$OTU
otu <- otu[,-1] # remove OTU column

# Check rownames and colnames matched
all(rownames(otu) == rownames(tax_table(tax))) # TRUE
all(colnames(otu) == rownames(sample_data(sam))) # TRUE

ps <- phyloseq(otu_table(otu, taxa_are_rows = TRUE), tax_table(tax), sample_data(sam))
> head(sam)
       SampleID
AQC1cm   AQC1cm
AQC4cm   AQC4cm
AQC7cm   AQC7cm
CC1         CC1
CL3         CL3
Even1     Even1

> head(tax)
          Kingdom    Phylum           Class                 Order               Family              Genus          Species
ASV_00001 "Bacteria" "Actinobacteria" "Actinobacteria"      "Actinomycetales"   "Microbacteriaceae" "Cryocola"     NA     
ASV_00002 "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Sphingomonadales"  "Sphingomonadaceae" NA             NA     
ASV_00003 "Bacteria" "Proteobacteria" "Deltaproteobacteria" "Desulfobacterales" "Desulfobulbaceae"  NA             NA     
ASV_00004 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Thiotrichales"     NA                  NA             NA     
ASV_00005 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Thiotrichales"     "Thiotrichaceae"    "Thiothrix"    NA     
ASV_00006 "Bacteria" "Firmicutes"     "Clostridia"          "Clostridiales"     "Lachnospiraceae"   "Ruminococcus" NA     

> head(otu)
          AQC1cm AQC4cm AQC7cm CC1 CL3 Even1 Even2 Even3 F21Plmr LMEpi24M M11Fcsw M11Plmr M11Tong M31Fcsw M31Plmr M31Tong NP2 NP3
ASV_00001      0      0      0   0   0     0     0     0       0        0       0       0       0       0       0       0   0   0
ASV_00002      0      0      0   0   0     0     0     0       0        0       0       0       0       0       0       0   0   0
ASV_00003      0      3      1   0   0     0     0     1       0        0       0       0       0       0       0       0   0   4
ASV_00004      1      7      1 136   5     0     1     0       0        0       0       0       0       0       0       0   0   0
ASV_00005      0      0      0   0   0     0     1     0       0        0       0       2       0       0       0       0   0   0
ASV_00006      0      0      0   0   0     0     2     1       0        0       0       0       0      58       0       0   0   0
          NP5 SLEpi20M SV1 TRRsed1 TRRsed2 TRRsed3 TS28 TS29
ASV_00001   0        0   7       0       0       0    0    0
ASV_00002   0        0   7       0       0       0    0    0
ASV_00003   0        0   0       6      71      44    0    0
ASV_00004   0        0   0       2       2       0    0    0
ASV_00005   0        0   0       0       0       0    0    0
ASV_00006   0        0   0       0       0       0    1    0

> ps
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
sample_data() Sample Data:       [ 26 samples by 1 sample variables ]
tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]