sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

output gather table into an R dataframe #2692

Open hehouts opened 1 year ago

hehouts commented 1 year ago

Question from stamps attendee Gabri (@mrgambero):

is there a script to turn sourmash gather results csvs into an R dataframe with a column for each of your samples?

this code definitely exists somewhere (and is probably in sourmash consumr ), but I have to go find it :)

(cc @ctb @bluegenes )

hehouts commented 1 year ago

From @bluegenes in slack:

do you want a table of rows = taxon and columns = samples?

if yes, running sourmash tax metagenome on gather results from all your samples and using output format (lineage_summary) should get you that. You'll need to select a taxonomic rank (e.g. species, genus, etc), though. I'll find some documentation for you

https://sourmash.readthedocs.io/en/latest/command-line.html#lineage-summary-output-format

lineage_summary output format The lineage summary format is most useful when comparing across metagenome queries. Each row is a lineage at the desired reporting rank. The columns are each query used for gather, with the fraction match reported for each lineage. This format is commonly used as input for many external multi-sample visualization tools. To generate lineage_summary, we add --output-format lineage_summary to the summarize command, and need to specify a rank to summarize. Here’s the command for reporting lineage_summary for two queries (HSMA33MX, PSM6XBW3) summary at species level.

sourmash tax metagenome
    --gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
    --gather-csv PSM6XBW3_gather_x_gtdbrs202_k31.csv \
    --taxonomy gtdb-rs202.taxonomy.v2.csv \
    --output-format lineage_summary --rank species

example lineage_summary:

lineage HSMA33MX PSM6XBW3
dBacteria;pBacteroidota;cBacteroidia;oBacteroidales;fBacteroidaceae;gPhocaeicola;s__Phocaeicola vulgatus 0.015637726014008795 0.015642822225843248
dBacteria;pBacteroidota;cBacteroidia;oBacteroidales;fBacteroidaceae;gPrevotella;s__Prevotella copri 0.05701254275940707 0.05703112269838684
dBacteria;pProteobacteria;cGammaproteobacteria;oEnterobacterales;fEnterobacteriaceae;gEscherichia;s__Escherichia coli 0.05815279361459521 0.05817174515235457

To produce multiple output types from the same command, add the types into the --output-format argument, e.g.--output-format summary lineage_summary krona

this isn't an R dataframe, but you could read this into one with readr::read_csv()

gabridinosauro commented 1 year ago

Hello @hehouts , thanks for the help!

So my idea was not to collapse at species level but to keep strains, because I want to see which functions are associated to those strains. So I think I am interested in the pure abundance from each match rather than collapsing at species level, it that makes sense?

ctb commented 1 year ago

Hello @hehouts , thanks for the help!

So my idea was not to collapse at species level but to keep strains, because I want to see which functions are associated to those strains. So I think I am interested in the pure abundance from each match rather than collapsing at species level, it that makes sense?

so... you can do this then directly on the gather table without taxonomy, but you won't get the benefit of taxonomic annotations in the same table.

this is an interesting use case for us to think about, @bluegenes!

bluegenes commented 1 year ago

ok - so since we don't have strains in our taxonomy lineages, what we could do is add a genome "rank", which would allow you to "summarize" at the genome level. I'll see if I can get it working!

gabridinosauro commented 1 year ago

@bluegenes and @ctb Thanks for the help, and yes I said strain but I meant different genomes (as I never really know what exactly is a strain). My idea is once I have the best matches for each genome I could go on and look if they contain any of the functions I am interested in.

gabridinosauro commented 1 year ago

Another question I have is, to do this would be better to use K =51 as it would be more precise at genome level? what are the drawbacks of k=51? just longer computation time?

ctb commented 1 year ago

Longer, more specific k-mers => you will not classify as many against database. k=21 would let you classify slightly more. No significant impact on computing time either way.

gabridinosauro commented 1 year ago

this is the code I used to import the gather results into R i used. It creates three dataframes, one for average abundance, one for standard abundance and one for intersect bp.

list_of_files = list.files("/Users/gabri/Library/CloudStorage/OneDrive-UniversityofArizona/Meta_gen_Daniel/MAGs/STAMPs_sourmash/", pattern = "match.csv")
list_of_names = sub("_match.csv", "", list_of_files)

listOfFiles <- lapply(list.files(), function(x) read.csv(x))
std_abund = lapply(listOfFiles,function(x) dplyr::select(x,c("name","std_abund"))) %>%   ##Select only the columns I need 
  purrr::reduce(full_join, by='name') %>%`colnames<-`(c("name",list_of_names))##Join files by name
std_abund[is.na(std_abund)] = 0

avg_abund = lapply(listOfFiles,function(x) dplyr::select(x,c("name","average_abund"))) %>%   ##Select only the columns I need 
  purrr::reduce(full_join, by='name') %>%`colnames<-`(c("name",list_of_names))##Join files by name
avg_abund[is.na(avg_abund)] = 0

intersect_bp = lapply(listOfFiles,function(x) dplyr::select(x,c("name","intersect_bp"))) %>%   ##Select only the columns I need (
  purrr::reduce(full_join, by='name') %>%`colnames<-`(c("name",list_of_names))##Join files by name
intersect_bp[is.na(intersect_bp)] = 0