sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

sourmash signature parser in R #1199

Open taylorreiter opened 4 years ago

taylorreiter commented 4 years ago

Dumping this here so it's recorded somewhere. This reads in a signature with multiple minhashes, output by gather using the flag --save-matches.

library(dplyr)
library(rjson)

sig_json <- fromJSON(file ="sandbox/test_megahit_diginorm_nocat/sandbox/try_comp_hashes_to_assemblies/cd_up_matches.sig")
hash_to_pangenome <- data.frame()
for(i in 1:length(sig_json)){
  sig <- sig_json[i]
  df <- as.data.frame(sig) 
  hash_to_pangenome <- rbind(hash_to_pangenome, df)
}

I only needed a subset of information and did things that are super specific to my project (like separate out the file name), soI'm including below what I actually ran and the dataframe output :

library(dplyr)
library(rjson)
library(tidyr)
sig_json <- fromJSON(file ="sandbox/test_megahit_diginorm_nocat/sandbox/try_comp_hashes_to_assemblies/cd_up_matches.sig")
hash_to_pangenome <- data.frame()
for(i in 1:length(sig_json)){
  sig <- sig_json[i]
  df <- as.data.frame(sig) %>%
    select(filename, name, signatures.mins) %>%
    mutate(hash = as.character(signatures.mins)) %>%
    select(-signatures.mins) %>%
    separate(name, into = c("aa_seq", "prokka"), sep = " ", extra = "merge")
  hash_to_pangenome <- rbind(hash_to_pangenome, df)
}
                                         filename        aa_seq               prokka             hash
1 GCF_900036035.1_RGNV35913_genomic.fna.cdhit.ffn    4013_01001 hypothetical protein 4536791505286563
2 GCF_900036035.1_RGNV35913_genomic.fna.cdhit.ffn 5004-01_01762 hypothetical protein 4156017257449275
3 GCF_900036035.1_RGNV35913_genomic.fna.cdhit.ffn 6005-01_00904 hypothetical protein 1656917721722184
4 GCF_900036035.1_RGNV35913_genomic.fna.cdhit.ffn  G36382_01696 hypothetical protein 3428886009462951
5 GCF_900036035.1_RGNV35913_genomic.fna.cdhit.ffn  G36382_01696 hypothetical protein 4124012860640125
6 GCF_900036035.1_RGNV35913_genomic.fna.cdhit.ffn  G36382_01696 hypothetical protein 8257410286124576
taylorreiter commented 3 years ago

More json parsing in R that may be helpful in the future -- this one doesn't require a forloop and so should be much faster. haven't tested on signatures yet though

carpentries_json <- read_json("https://feeds.carpentries.org/dc_past_workshops.json")
carpentries_df <- do.call(rbind, lapply(carpentries_json, rbind))
carpentries_df <- carpentries_df %>%
  as.data.frame() %>%
  mutate_all(as.character)