sourmash signature parser in R

Dumping this here so it's recorded somewhere. This reads in a signature with multiple minhashes, output by gather using the flag --save-matches.

library(dplyr)
library(rjson)

sig_json <- fromJSON(file ="sandbox/test_megahit_diginorm_nocat/sandbox/try_comp_hashes_to_assemblies/cd_up_matches.sig")
hash_to_pangenome <- data.frame()
for(i in 1:length(sig_json)){
  sig <- sig_json[i]
  df <- as.data.frame(sig) 
  hash_to_pangenome <- rbind(hash_to_pangenome, df)
}

I only needed a subset of information and did things that are super specific to my project (like separate out the file name), soI'm including below what I actually ran and the dataframe output :

library(dplyr)
library(rjson)
library(tidyr)
sig_json <- fromJSON(file ="sandbox/test_megahit_diginorm_nocat/sandbox/try_comp_hashes_to_assemblies/cd_up_matches.sig")
hash_to_pangenome <- data.frame()
for(i in 1:length(sig_json)){
  sig <- sig_json[i]
  df <- as.data.frame(sig) %>%
    select(filename, name, signatures.mins) %>%
    mutate(hash = as.character(signatures.mins)) %>%
    select(-signatures.mins) %>%
    separate(name, into = c("aa_seq", "prokka"), sep = " ", extra = "merge")
  hash_to_pangenome <- rbind(hash_to_pangenome, df)
}

                                         filename        aa_seq               prokka             hash
1 GCF_900036035.1_RGNV35913_genomic.fna.cdhit.ffn    4013_01001 hypothetical protein 4536791505286563
2 GCF_900036035.1_RGNV35913_genomic.fna.cdhit.ffn 5004-01_01762 hypothetical protein 4156017257449275
3 GCF_900036035.1_RGNV35913_genomic.fna.cdhit.ffn 6005-01_00904 hypothetical protein 1656917721722184
4 GCF_900036035.1_RGNV35913_genomic.fna.cdhit.ffn  G36382_01696 hypothetical protein 3428886009462951
5 GCF_900036035.1_RGNV35913_genomic.fna.cdhit.ffn  G36382_01696 hypothetical protein 4124012860640125
6 GCF_900036035.1_RGNV35913_genomic.fna.cdhit.ffn  G36382_01696 hypothetical protein 8257410286124576

sourmash-bio / sourmash

sourmash signature parser in R #1199