Best practice to parse data from QIIME

jrherr commented 7 years ago

Hi @zachary-foster!

Thanks so much for your package!

I'm trying to parse the taxonomic information output from QIIME, and I am wondering the best way to do this. With command line tools I have a file that looks like this with thousands of lines:

k__Archaea; p__Crenarchaeota; c__Thermoprotei; o__YNPFFA; f__SK322; g__; s__
k__Bacteria; p__Acidobacteria; c__DA052; o__Ellin6513; f__; g__; s__
k__Bacteria; p__Acidobacteria; c__DA052; o__Ellin6513; f__; g__; s__
k__Bacteria; p__Actinobacteria; c__Thermoleophilia; o__Gaiellales; f__Gaiellaceae; g__; s__
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__; g__; s__
k__Bacteria; p__Chlamydiae; c__Chlamydiia; o__Chlamydiales; f__; g__; s__
k__Bacteria; p__Verrucomicrobia; c__Pedosphaerae; o__Pedosphaerales; f__Ellin515; g__; s__
k__Archaea; p__Crenarchaeota; c__MBGA; o__; f__; g__; s__
k__Archaea; p__Crenarchaeota; c__MBGA; o__NRP-J; f__; g__; s__

Duplicate lines with identical taxonomy are OTUs clustering at 97% with similar taxonomic resolution.

I've been banging my head with ways to parse this with metacoder. Can you provide a best practices for this type of file? Thanks!

zachary-foster commented 7 years ago

Hello @jrherr! No problem!

The following code will parse that data:

# Read data into R
input <- "k__Archaea; p__Crenarchaeota; c__Thermoprotei; o__YNPFFA; f__SK322; g__; s__
k__Bacteria; p__Acidobacteria; c__DA052; o__Ellin6513; f__; g__; s__
k__Bacteria; p__Acidobacteria; c__DA052; o__Ellin6513; f__; g__; s__
k__Bacteria; p__Actinobacteria; c__Thermoleophilia; o__Gaiellales; f__Gaiellaceae; g__; s__
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__; g__; s__
k__Bacteria; p__Chlamydiae; c__Chlamydiia; o__Chlamydiales; f__; g__; s__
k__Bacteria; p__Verrucomicrobia; c__Pedosphaerae; o__Pedosphaerales; f__Ellin515; g__; s__
k__Archaea; p__Crenarchaeota; c__MBGA; o__; f__; g__; s__
k__Archaea; p__Crenarchaeota; c__MBGA; o__NRP-J; f__; g__; s__"
input <- strsplit(input, split = "\n")[[1]] # You would probably use `scan` or `readlines` to do this from a file

# Extract taxonomy
library(metacoder)
data <- extract_taxonomy(input,
                         key = "class",
                         class_regex = "^(.*)__(.*)$",
                         class_key = c(rank = "taxon_info", "name"),
                         class_sep = "; ")

# Test plot
heat_tree(data,
          node_size = n_obs,
          node_color = n_obs,
          node_label = name)

> data
`taxmap` object with data for 47 taxa and 9 observations:

------------------------------------------------------------------------ taxa ------------------------------------------------------------------------
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 ... 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47

--------------------------------------------------------------------- taxon_data ---------------------------------------------------------------------
# A tibble: 47 × 4
  taxon_ids supertaxon_ids  rank          name
      <chr>          <chr> <chr>         <chr>
1         1           <NA>     k       Archaea
2         2           <NA>     k      Bacteria
3         3              1     p Crenarchaeota
4         4              3     c          MBGA
5         5              3     c  Thermoprotei
6         6              4     o              
7         7              4     o         NRP-J
# ... with 40 more rows

---------------------------------------------------------------------- obs_data ----------------------------------------------------------------------
# A tibble: 9 × 1
  obs_taxon_ids
          <chr>
1            17
2            27
3            27
4            32
5            37
6            42
7            47
# ... with 2 more rows

--------------------------------------------------------------------- taxon_funcs ---------------------------------------------------------------------
n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies

Since only a classification string is present, you don't need the regex option, since it defaults to matching the whole input.

Do you think I should add this example to the documentation somewhere? The extract taxonomy tutorial is already pretty long:

https://grunwaldlab.github.io/metacoder_documentation/vignettes--01--extracting_taxonomy_data.html

Maybe I should split out some of the examples there to a new "Examples" or "FAQ" section on the website documentation?

zachary-foster commented 7 years ago

@jrherr, did this work for you? I am working on adding parsers for each standard format I come across to make things easier. I noticed that QIIME outputs the BIOM format and I just made a parser for that. What command did you get this input from? Thanks!

peterjc commented 1 year ago

I was just looking to see if metacoder had BIOM https://biom-format.org/ import functionality, it would fit nicely under https://grunwaldlab.github.io/metacoder_documentation/workshop--03--parsing.html in place of a plain text "Abundance Matrix" and possibly also the "OTU Data Table" (if the BIOM file includes taxonomic assignments).

zachary-foster commented 1 year ago

It would be nice to have that as an option in the workshop material. I think having the plain text as well is useful since a lot of methods result in data in that format (e.g. dada2). I actually have not used QIIME, so I have not use the BIOM format much. I probably should look into the format more. Thanks for the suggestion!

grunwaldlab / metacoder

Best practice to parse data from QIIME #135