Problem with own dataset

lfenske-93 commented 8 months ago

Hi,

I stumbled across your tool and have the feeling that it is just right for my application, but even with the help of your documentation I haven't quite figured out if and how I can use it for my data set.

I have taxonomic data from GTDBtk and my aim is to map the bias within this data, i.e. to show which taxa are particularly abundant.

My dataset looks like this, with all columns tab-separated. And I'm trying to find out how exactly I can convert this data set into a taxmap object or what I need to do first.

domain | phylum | class | order | family | genus | species
Bacteria | Firmicutes | Bacilli | Lactobacillales | Lactobacillaceae | Secundilactobacillus | oryzae
Bacteria | Firmicutes | Bacilli | Lactobacillales | Lactobacillaceae | Secundilactobacillus | oryzae
Bacteria | Actinobacteriota | Actinomycetia | Mycobacteriales | Mycobacteriaceae | Corynebacterium | glutamicum
(...)

This was my first attempt, but it doesn't looks right and I'm struggling creating a heat_tree out of this:

> tax_info_obj <- parse_tax_data(tax_info,class_sep = "\t")
> print(tax_info_obj)
<Taxmap>
  1 taxa: b. Bacteria
  1 edges: NA->b
  1 data sets:
    tax_data:
      # A tibble: 657,752 x 8
        taxon_id gtdbtk.classification~1 gtdbtk.classificatio~2 gtdbtk.classificatio~3 gtdbtk.classificatio~4 gtdbtk.classificatio~5
        <chr>    <chr>                   <chr>                  <chr>                  <chr>                  <chr>                 
      1 b        Bacteria                Firmicutes             Bacilli                Lactobacillales        Lactobacillaceae      
      2 b        Bacteria                Firmicutes             Bacilli                Lactobacillales        Lactobacillaceae      
      3 b        Bacteria                Firmicutes             Bacilli                Lactobacillales        Streptococcaceae      
      # i 657,749 more rows
      # i abbreviated names: 1: gtdbtk.classification.domain, 2: gtdbtk.classification.phylum, 3: gtdbtk.classification.class,
      #   4: gtdbtk.classification.order, 5: gtdbtk.classification.family
      # i 2 more variables: gtdbtk.classification.genus <chr>, species_epithet <chr>
      # i Use `print(n = ...)` to see more rows
  0 functions:

Perhaps one of you can give me a brief idea of how exactly I can work with my data set. It seems to be a relatively simple example, but unfortunately I'm still stuck.

Kind regards, Linda

dhadsell commented 8 months ago

I had a similar problem getting my data into package. I started out by trying to follow the "EXAMPLE ANALYSIS" on the following page https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/index.html. Then I went to the WORK SHOP menu on the same page and found a collection of pages that will take you step-by-step through the analysis. I found the "REQUIRED DATASETS" page very helpful! it has example files that illustrate exactly the format you need. I also made sure to install all the necessary software and dependencies listed in the REQUIRED SOFTWARE page. Since doing that I just followed the remainging pages and had success. Hope this helps.

zachary-foster commented 8 months ago

This should be doable with parse_tax_data. It is extremely flexible and I recommend looking at its help page and examples. I can recommend a way to parse your data, but I am confused about the format. You say:

My dataset looks like this, with all columns tab-separated.

But it looks like it is separated by |. Can you attach a subset of the data as a file? Thanks!

lfenske-93 commented 8 months ago

Hi,

sorry for the confusion, the dataset is tab-separated I just tried to post a somewhat understandable example here. 😅

I attached a subset of my dataset. I tried with parse_tax_data but I probably just didn't do it quite right.

Many thanks for your help! Best regards, Linda

taxinfo_subset.csv

zachary-foster commented 8 months ago

No worries! Thanks for the example data. Here is how to parse that data format:

library(metacoder)
#> This is metacoder version 0.3.6 (stable)
library(readr)

raw_data <- read_tsv('~/Downloads/taxinfo_subset.csv')
#> Rows: 28 Columns: 7
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (7): domain, phylum, class, order, family, genus, species
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
x <- parse_tax_data(raw_data, class_cols = 1:7, named_by_rank = TRUE)
print(x)
#> <Taxmap>
#>   44 taxa: ab. Bacteria, ac. Firmicutes ... br. coli, bs. subtilis
#>   44 edges: NA->ab, ab->ac, ab->ad ... be->bq, bf->br, bg->bs
#>   1 data sets:
#>     tax_data:
#>       # A tibble: 28 × 8
#>         taxon_id domain   phylum     class  order family genus species
#>         <chr>    <chr>    <chr>      <chr>  <chr> <chr>  <chr> <chr>  
#>       1 bh       Bacteria Firmicutes Bacil… Lact… Lacto… Pauc… hokkai…
#>       2 bi       Bacteria Firmicutes Bacil… Lact… Lacto… Secu… oryzae 
#>       3 bj       Bacteria Firmicutes Bacil… Lact… Strep… Stre… pyogen…
#>       # ℹ 25 more rows
#>   0 functions:
heat_tree(x, node_label = taxon_names, node_size = n_obs, node_color = n_obs)

^{Created on 2023-11-17 with reprex v2.0.2}

lfenske-93 commented 8 months ago

Many thanks! Then I wasn't that far off with my attempt after all.

I'm looking forward to playing around with it a bit, great thing you've created. ❤️

zachary-foster commented 8 months ago

No problem! Thank you!

grunwaldlab / metacoder

Problem with own dataset #360