Open hehouts opened 1 year ago
From @bluegenes in slack:
do you want a table of rows = taxon and columns = samples?
if yes, running
sourmash tax metagenome
on gather results from all your samples and using output format (lineage_summary
) should get you that. You'll need to select a taxonomic rank (e.g. species, genus, etc), though. I'll find some documentation for youhttps://sourmash.readthedocs.io/en/latest/command-line.html#lineage-summary-output-format
lineage_summary
output format
The lineage summary format is most useful when comparing across metagenome queries. Each row is a lineage at the desired reporting rank. The columns are each query used for gather, with the fraction match reported for each lineage. This format is commonly used as input for many external multi-sample visualization tools.
To generate lineage_summary
, we add --output-format lineage_summary
to the summarize command, and need to specify a rank to summarize. Here’s the command for reporting lineage_summary
for two queries (HSMA33MX, PSM6XBW3) summary at species
level.
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--gather-csv PSM6XBW3_gather_x_gtdbrs202_k31.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv \
--output-format lineage_summary --rank species
example lineage_summary
:
lineage | HSMA33MX | PSM6XBW3 |
---|---|---|
dBacteria;pBacteroidota;cBacteroidia;oBacteroidales;fBacteroidaceae;gPhocaeicola;s__Phocaeicola vulgatus | 0.015637726014008795 | 0.015642822225843248 |
dBacteria;pBacteroidota;cBacteroidia;oBacteroidales;fBacteroidaceae;gPrevotella;s__Prevotella copri | 0.05701254275940707 | 0.05703112269838684 |
dBacteria;pProteobacteria;cGammaproteobacteria;oEnterobacterales;fEnterobacteriaceae;gEscherichia;s__Escherichia coli | 0.05815279361459521 | 0.05817174515235457 |
To produce multiple output types from the same command, add the types into the --output-format
argument, e.g.--output-format summary lineage_summary krona
this isn't an R dataframe, but you could read this into one with readr::read_csv()
Hello @hehouts , thanks for the help!
So my idea was not to collapse at species level but to keep strains, because I want to see which functions are associated to those strains. So I think I am interested in the pure abundance from each match rather than collapsing at species level, it that makes sense?
Hello @hehouts , thanks for the help!
So my idea was not to collapse at species level but to keep strains, because I want to see which functions are associated to those strains. So I think I am interested in the pure abundance from each match rather than collapsing at species level, it that makes sense?
so... you can do this then directly on the gather table without taxonomy, but you won't get the benefit of taxonomic annotations in the same table.
this is an interesting use case for us to think about, @bluegenes!
ok - so since we don't have strains in our taxonomy lineages, what we could do is add a genome
"rank", which would allow you to "summarize" at the genome level. I'll see if I can get it working!
@bluegenes and @ctb Thanks for the help, and yes I said strain but I meant different genomes (as I never really know what exactly is a strain). My idea is once I have the best matches for each genome I could go on and look if they contain any of the functions I am interested in.
Another question I have is, to do this would be better to use K =51 as it would be more precise at genome level? what are the drawbacks of k=51? just longer computation time?
Longer, more specific k-mers => you will not classify as many against database. k=21 would let you classify slightly more. No significant impact on computing time either way.
this is the code I used to import the gather results into R i used. It creates three dataframes, one for average abundance, one for standard abundance and one for intersect bp.
list_of_files = list.files("/Users/gabri/Library/CloudStorage/OneDrive-UniversityofArizona/Meta_gen_Daniel/MAGs/STAMPs_sourmash/", pattern = "match.csv")
list_of_names = sub("_match.csv", "", list_of_files)
listOfFiles <- lapply(list.files(), function(x) read.csv(x))
std_abund = lapply(listOfFiles,function(x) dplyr::select(x,c("name","std_abund"))) %>% ##Select only the columns I need
purrr::reduce(full_join, by='name') %>%`colnames<-`(c("name",list_of_names))##Join files by name
std_abund[is.na(std_abund)] = 0
avg_abund = lapply(listOfFiles,function(x) dplyr::select(x,c("name","average_abund"))) %>% ##Select only the columns I need
purrr::reduce(full_join, by='name') %>%`colnames<-`(c("name",list_of_names))##Join files by name
avg_abund[is.na(avg_abund)] = 0
intersect_bp = lapply(listOfFiles,function(x) dplyr::select(x,c("name","intersect_bp"))) %>% ##Select only the columns I need (
purrr::reduce(full_join, by='name') %>%`colnames<-`(c("name",list_of_names))##Join files by name
intersect_bp[is.na(intersect_bp)] = 0
Question from stamps attendee Gabri (@mrgambero):
this code definitely exists somewhere (and is probably in sourmash consumr ), but I have to go find it :)
(cc @ctb @bluegenes )