write_biom output appears invalid due to improper types in format metadata

blankenberg commented 7 years ago

"format", "format_url", "type":, "generated_by", "date", "id"s within rows/columns, "matrix_type", etc are all wrapped in lists ([]), whereas the specification states these should be strings: http://biom-format.org/documentation/format_versions/biom-1.0.html

Files created by the write_biom method cannot be parsed by the official biom-format command-line tools, resulting in the error of TypeError: <filename> does not appear to be a BIOM file!

$ R

R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library('biomformat')
> packageVersion('biomformat')
[1] ‘1.2.0’
> biom_file <- system.file("extdata", "rich_sparse_otu_table.biom", package = "biomformat")
> x = read_biom(biom_file)
> outfile = tempfile()
> write_biom(x, outfile)
> system(paste0("cat ", outfile) )
{"id":{},"format":["Biological Observation Matrix 1.0.0-dev"],"format_url":["http://biom-format.org"],"type":["OTU table"],"generated_by":["QIIME revision XYZ"],"date":["2011-12-19T19:00:00"],"rows":[{"id":["GG_OTU_1"],"metadata":{"taxonomy":["k__Bacteria","p__Proteobacteria","c__Gammaproteobacteria","o__Enterobacteriales","f__Enterobacteriaceae","g__Escherichia","s__"]}},{"id":["GG_OTU_2"],"metadata":{"taxonomy":["k__Bacteria","p__Cyanobacteria","c__Nostocophycideae","o__Nostocales","f__Nostocaceae","g__Dolichospermum","s__"]}},{"id":["GG_OTU_3"],"metadata":{"taxonomy":["k__Archaea","p__Euryarchaeota","c__Methanomicrobia","o__Methanosarcinales","f__Methanosarcinaceae","g__Methanosarcina","s__"]}},{"id":["GG_OTU_4"],"metadata":{"taxonomy":["k__Bacteria","p__Firmicutes","c__Clostridia","o__Halanaerobiales","f__Halanaerobiaceae","g__Halanaerobium","s__Halanaerobiumsaccharolyticum"]}},{"id":["GG_OTU_5"],"metadata":{"taxonomy":["k__Bacteria","p__Proteobacteria","c__Gammaproteobacteria","o__Enterobacteriales","f__Enterobacteriaceae","g__Escherichia","s__"]}}],"columns":[{"id":["Sample1"],"metadata":{"BarcodeSequence":["CGCTTATCGAGA"],"LinkerPrimerSequence":["CATGCTGCCTCCCGTAGGAGT"],"BODY_SITE":["gut"],"Description":["human gut"]}},{"id":["Sample2"],"metadata":{"BarcodeSequence":["CATACCAGTAGC"],"LinkerPrimerSequence":["CATGCTGCCTCCCGTAGGAGT"],"BODY_SITE":["gut"],"Description":["human gut"]}},{"id":["Sample3"],"metadata":{"BarcodeSequence":["CTCTCTACCTGT"],"LinkerPrimerSequence":["CATGCTGCCTCCCGTAGGAGT"],"BODY_SITE":["gut"],"Description":["human gut"]}},{"id":["Sample4"],"metadata":{"BarcodeSequence":["CTCTCGGCCTGT"],"LinkerPrimerSequence":["CATGCTGCCTCCCGTAGGAGT"],"BODY_SITE":["skin"],"Description":["human skin"]}},{"id":["Sample5"],"metadata":{"BarcodeSequence":["CTCTCTACCAAT"],"LinkerPrimerSequence":["CATGCTGCCTCCCGTAGGAGT"],"BODY_SITE":["skin"],"Description":["human skin"]}},{"id":["Sample6"],"metadata":{"BarcodeSequence":["CTAACTACCAAT"],"LinkerPrimerSequence":["CATGCTGCCTCCCGTAGGAGT"],"BODY_SITE":["skin"],"Description":["human skin"]}}],"matrix_type":["sparse"],"matrix_element_type":["int"],"shape":[5,6],"data":[[0,2,1],[1,0,5],[1,1,1],[1,3,2],[1,4,3],[1,5,1],[2,2,1],[2,3,4],[2,5,2],[3,0,2],[3,1,1],[3,2,1],[3,5,1],[4,1,1],[4,2,1]]}>

blankenberg commented 7 years ago

@joey711 @jnpaulson Any thoughts, comments, or a potential fix available? Is this project still alive?

jnpaulson commented 7 years ago

Definitely still alive. @joey711 any thoughts?

joey711 commented 7 years ago

Yes, we should fix this. While I'm pretty sure the R-write-read round trip would work fine, we do want the result to be considered valid by external utilities like those in the python library.

This should be pretty easy to fix by coercing the slot for each of those offending entries to be character vectors rather than character vectors encapsulated by an inner list. Something like the following should work for length-1 cases:

myBiomList$format <- as.character(myBiomList$format[[1]][1])

Of course we want to abstract this to a function that enforces this type coercion for us, so something like

biom_char_not_list = function(x){
  as.character(x[[1]])
}

and then the first example becomes

myBiomList$format <- biom_char_not_list(myBiomList$format)[1]

We can leave the length expectation inside or outside the function. Depends on whether we need to do this for any entries that can/should have length longer than 1.

Matrix and table entries tend to have type specs that we've probably already addressed, but we should double-check while we're at it.

The above solution should be implemented in a validation function that is called by write_biom, so that other functions can re-use this, and that type enforcement is consistent.

joey711 commented 7 years ago

I'd rather not find out way after the fact that minor format specs are causing output to fail read-validation in other utilities. I also don't want to re-implement all of the tests that are already written in the python library, so we should probably include in this package a testing script that uses the python read commands to test this packages' write output.

I'll link this issue on a new testing "feature" request

joey711 / biomformat-oldfork

write_biom output appears invalid due to improper types in format metadata #4