Open jrherr opened 7 years ago
Hello @jrherr! No problem!
The following code will parse that data:
# Read data into R
input <- "k__Archaea; p__Crenarchaeota; c__Thermoprotei; o__YNPFFA; f__SK322; g__; s__
k__Bacteria; p__Acidobacteria; c__DA052; o__Ellin6513; f__; g__; s__
k__Bacteria; p__Acidobacteria; c__DA052; o__Ellin6513; f__; g__; s__
k__Bacteria; p__Actinobacteria; c__Thermoleophilia; o__Gaiellales; f__Gaiellaceae; g__; s__
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__; g__; s__
k__Bacteria; p__Chlamydiae; c__Chlamydiia; o__Chlamydiales; f__; g__; s__
k__Bacteria; p__Verrucomicrobia; c__Pedosphaerae; o__Pedosphaerales; f__Ellin515; g__; s__
k__Archaea; p__Crenarchaeota; c__MBGA; o__; f__; g__; s__
k__Archaea; p__Crenarchaeota; c__MBGA; o__NRP-J; f__; g__; s__"
input <- strsplit(input, split = "\n")[[1]] # You would probably use `scan` or `readlines` to do this from a file
# Extract taxonomy
library(metacoder)
data <- extract_taxonomy(input,
key = "class",
class_regex = "^(.*)__(.*)$",
class_key = c(rank = "taxon_info", "name"),
class_sep = "; ")
# Test plot
heat_tree(data,
node_size = n_obs,
node_color = n_obs,
node_label = name)
> data
`taxmap` object with data for 47 taxa and 9 observations:
------------------------------------------------------------------------ taxa ------------------------------------------------------------------------
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 ... 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
--------------------------------------------------------------------- taxon_data ---------------------------------------------------------------------
# A tibble: 47 × 4
taxon_ids supertaxon_ids rank name
<chr> <chr> <chr> <chr>
1 1 <NA> k Archaea
2 2 <NA> k Bacteria
3 3 1 p Crenarchaeota
4 4 3 c MBGA
5 5 3 c Thermoprotei
6 6 4 o
7 7 4 o NRP-J
# ... with 40 more rows
---------------------------------------------------------------------- obs_data ----------------------------------------------------------------------
# A tibble: 9 × 1
obs_taxon_ids
<chr>
1 17
2 27
3 27
4 32
5 37
6 42
7 47
# ... with 2 more rows
--------------------------------------------------------------------- taxon_funcs ---------------------------------------------------------------------
n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies
Since only a classification string is present, you don't need the regex
option, since it defaults to matching the whole input.
Do you think I should add this example to the documentation somewhere? The extract taxonomy tutorial is already pretty long:
https://grunwaldlab.github.io/metacoder_documentation/vignettes--01--extracting_taxonomy_data.html
Maybe I should split out some of the examples there to a new "Examples" or "FAQ" section on the website documentation?
@jrherr, did this work for you? I am working on adding parsers for each standard format I come across to make things easier. I noticed that QIIME outputs the BIOM format and I just made a parser for that. What command did you get this input from? Thanks!
I was just looking to see if metacoder had BIOM https://biom-format.org/ import functionality, it would fit nicely under https://grunwaldlab.github.io/metacoder_documentation/workshop--03--parsing.html in place of a plain text "Abundance Matrix" and possibly also the "OTU Data Table" (if the BIOM file includes taxonomic assignments).
It would be nice to have that as an option in the workshop material. I think having the plain text as well is useful since a lot of methods result in data in that format (e.g. dada2). I actually have not used QIIME, so I have not use the BIOM format much. I probably should look into the format more. Thanks for the suggestion!
Hi @zachary-foster!
Thanks so much for your package!
I'm trying to parse the taxonomic information output from QIIME, and I am wondering the best way to do this. With command line tools I have a file that looks like this with thousands of lines:
Duplicate lines with identical taxonomy are OTUs clustering at 97% with similar taxonomic resolution.
I've been banging my head with ways to parse this with metacoder. Can you provide a best practices for this type of file? Thanks!