joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 187 forks source link

importing BIOM file #443

Closed alejorojas2 closed 7 years ago

alejorojas2 commented 9 years ago

Hi Joey711,

I have been trying to import a biom file that I generated with qiime, the file is binary file, it seems that is new format for this kind of files, then when trying to work with it in R I get this error:

This is my code so far:

#Importing with BIOM function
biom_file <- "./COI_seed_otu_table.biom"
tree_file <- "./COI_fasttree_usearch.tre"

COI_data <- import_biom(biom_file)

COI_data <- import_biom(biom_file) Error in nchar(content) : invalid multibyte string 1

#importing with qiime function
COI_data <- import_qiime(biom_file)

COI_data <- import_qiime(BIOMfilename = biom_file) Error in arglist[sapply(arglist, is.component.class)] : invalid subscript type 'list'

Maybe I am missing something or should I use an older version of BIOM file?

joey711 commented 9 years ago

Binary file? That sounds like the hdf5-based "version 2" biom format. The original biom format was not binary, a JSON based format. The original biom format is what is currently supported in phyloseq.

The biom-format team told me about the update probably 1 year ago. You are the first to post to phyloseq issue tracker about it. I've been discussing with @nosson (Joe Paulson) about adding biom-v2 support in the biom package on CRAN.

The more this becomes an issue, the faster one of us will support it. I personally have not encountered a need for HDF5-based storage of microbiome count data.

If you have some way of storing your OTU table in a biom v1 format, legacy QIIME format, or even just a tab-delimited table or any format that can be read by R, then you will be able to use phyloseq.

Hope that helps. I'll leave this issue open for now until there are further (and more helpful) comments about workarounds and/or actual biom-v2 support.

Cheers

joey

pljames commented 9 years ago

I had to convert my biom format using "convert biom" which is distributed with qiime, perhaps this is the issue? Maybe try:

biom convert -i otu_table.biom -o otu_table_json.biom --table-type="OTU table" --to-json

and see if that works?

All the best,

Phill

anmwinter commented 9 years ago

In QIIME 1.9 the biom files are all in the new format and are not read in correctly into phyloseq. The only work around I have going at the moment is an install of QIIME 1.8 to make the biom files.

jnpaulson commented 9 years ago

Hi Ara,

If you'd like, can you go ahead and test this function: https://gist.github.com/jnpaulson/c9cdcb8cf95e7daae468

This will import biom vs. 2.0 files. If you find it works well I'm sure Joey will be able to roll it into the Biom package after a few unit tests are added (which shouldn't take us too long).

On Sun, Mar 15, 2015 at 9:55 PM, Ara notifications@github.com wrote:

In QIIME 1.9 the biom files are all in the new format and are not read in correctly into phyloseq. The only work around I have going at the moment is an install of QIIME 1.8 to make the biom files.

— Reply to this email directly or view it on GitHub https://github.com/joey711/phyloseq/issues/443#issuecomment-81343556.

alejorojas2 commented 9 years ago

Joey and Phil,

I did try the command that Phil suggested converting the BIOM file to json format, it worked, now I am able to import my data to phyloseq without any issues. But I looked forward to the function to read hdf5 files, this can be helpful since BIOM files are moving into that direction.

Thanks!

Alejandro

anmwinter commented 9 years ago

@nosson Thanks! I will give that a try.

jnpaulson commented 9 years ago

@alejorojas2 @bioinfonm If you guys are willing to be testers for the gist I posted above I know Joey will import it into biom package. I just need to write the documentation. Is there a desire/need to "write" hdf5 biom2 files?

anmwinter commented 9 years ago

@nosson Sounds good. Phyloseq is our main tech for processing and understanding our microbial data. So anything I can do to help out I am game for.

I'll give it a try here around lunch!

anmwinter commented 9 years ago

@nosson Does the biom file need to be a rich format? I dumped out a biom file with the default settings in QIIME 1.9

Aras-MacBook-Air-2:qiime_1_9_test arakooser$ make_otu_table.py -i seqs_otus.txt -t seqs_rep_set_tax_assignments.txt -o de_novo_sumaclust.biom

and the got this error in phyloseq using the biom 2.x function

> biom_file = "de_novo_sumaclust.biom"    ###This is your .biom file
> map_file = "LavaBedsCombinedMapping.txt"      ###This is your mapping file with all the metadata
> y = read_hdf5_biom(biom_file)
HDF5-DIAG: Error detected in HDF5 (1.8.7) thread 0:
  #000: H5F.c line 1522 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: H5F.c line 1313 in H5F_open(): unable to read superblock
    major: File accessability
    minor: Read failed
  #002: H5Fsuper.c line 334 in H5F_super_read(): unable to find file signature
    major: File accessability
    minor: Not an HDF5 file
  #003: H5Fsuper.c line 155 in H5F_locate_signature(): unable to find a valid file signature
    major: Low-level I/O
    minor: Unable to initialize object
HDF5: unable to open file
Error in h5checktypeOrOpenLoc(file, readonly = TRUE) : 
  Error in h5checktypeOrOpenLoc(). File 'de_novo_sumaclust.biom' is not a valid HDF5 file.
Called from: h5read(file_input, "/", read.attributes = TRUE)
Browse[1]> Q
jnpaulson commented 9 years ago

@bioinfonm Mhmm, just for clarification - this is not a phyloseq error. We're using the R packages biom rhdf5 to import this file. Converting it into a biom-class object (as Joey defined in the biom package) then lets us easily convert that to an object that can be parsed by phyloseq or metagenomeSeq.

Do you think you could send me a MRE (minimally reproducible example?). The ones I have from biom-format.org worked, but having a simple real example like yours would help with debugging. I don't personally have Qiime wrappers installed so I can't create my own without lots of effort. If you can get something to me my email is jpaulson@umiacs.umd.edu.

anmwinter commented 9 years ago

@nosson Will do! I'll send the whole file over.

Ara

alejorojas2 commented 9 years ago

@nosson I am traveling right know but as soon as I can test it, I will let you know. About the need for hdf5, for me it was just by chance since it is the first time that used qiime for my analysis and I think it is the default behavior now.

anmwinter commented 9 years ago

So a similar problem cropped up on the qiime forum here: https://github.com/biocore/qiime/issues/1928

Looks like something goofy with the way QIIME is handling the hdf5 format.

The solution at the moment that works is: biom convert -i de_novo_sumaclust.biom -o fixed.biom --table-type="OTU table" --to-hdf5

alejorojas2 commented 9 years ago

@nosson I did try your function and it worked for me, I had to install the library for BIOM and rhdf5 to manage to run your function. The biom 0.3.12 and rhdf5 2.10.0 were the versions that installed on my R.

The I ran the the function and run your command:

y = read_hdf5_biom("table_even7620.biom")
biom(y)

then I get this output, that actually makes sense based on my data:

biom object. 
type: OTU table 
matrix_type: dense 
714 rows and 36 columns 

So, If this is a biom object, the next step will be importing the file using import_biom(BIOMfilename = y). update: I assign y1 <- biom(y) to save it as biom object, but I still get the same message:

> import_biom(BIOMfilename = y1)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘fromJSON’ for signature ‘"biom", "missing"’

So it seems that your functions creates a list instead of a biom file, or maybe I am missing something?

Thanks!

Alejandro

jnpaulson commented 9 years ago

@alejorojas2 Awesome! Glad to hear it works for you - it seems like some of the default outputs by Qiime are not 'true' HDF5 files (@bioinfonm's was not).

Glad to hear it worked for you! This is not an issue as Phyloseq uses biom class objects natively in R. Essentially instead of calling import_biom you would run a truncated version of that function: https://github.com/joey711/phyloseq/blob/master/R/IO-methods.R#L1777-L1849

  1. As an example, you can use this function to load the file into phyloseq (import_biom2(y1)) https://gist.github.com/jnpaulson/324ac1fa3eab1bc7f845
  2. @joey711 can easily rewrite the import_biom function to
import_biom <- function(BIOMfilename, 
    treefilename=NULL, refseqfilename=NULL, refseqFunction=readDNAStringSet, refseqArgs=NULL,
    parseFunction=parse_taxonomy_default, parallel=FALSE, version=1.0, ...){

    # initialize the argument-list for phyloseq. Start empty.
    argumentlist <- list()

    # Read the data
    if(class(x)=="biom"){ # or check if string first
        x = BIOMfilename
    } else {
        x = read_biom(biom_file=BIOMfilename)
    }
  1. A longer, worse way, would be to write the old biom file format write_biom(y1,"file/path/filename.biom") then import it using import_biom("/file/path/filename.biom").
  2. Use metagenomeSeq's biom2MRexperiment then use @joey711's MRexperiment2phyloseq function :)
LBragg commented 9 years ago

Hi there,

I am following this thread with interest as I am also using the latest QIIME. The read_hdf5_biom function provided worked, but I was getting the same error as CarlyRae did (see comments here https://gist.github.com/nosson/324ac1fa3eab1bc7f845) for the import_biom2 function. The error suggests a list is not the expected input, as it is expecting the biom-class (subclass of List) as input, I modified the code to convert the list to a biom object... and it seems to work. But I just wanted to check with you guys if this was the Right Thing to Do :)

import_biom2 <- function(x, 
                     treefilename=NULL, refseqfilename=NULL, refseqFunction=readDNAStringSet, refseqArgs=NULL,
                     parseFunction=parse_taxonomy_default, parallel=FALSE, version=1.0, ...){

  # initialize the argument-list for phyloseq. Start empty.
  argumentlist <- list()

 x = biom(x) # added this line

EDIT: I just realised I haven't got the latest bioconductor installed. I'll update everything and get back to you if there is still a problem.

LBragg commented 9 years ago

I upgraded to R 3.2 (Bioconductor version 3.1 (BiocInstaller 1.18.2)), installed the phyloseq library (‘1.13.2’), and still needed to add that line (x = biom(x)) to the import_biom2 function to get it to work. But I think there are still issues, as the taxonomy parsing isn't working correctly (I've tried default and the parse_taxonomy_qiime options too). I thought this all might work if I could install the dev version of phyloseq, but I can't seem to get this to work.

> library("devtools")
> install_github("phyloseq", "joey711")
Downloading github repo joey711/phyloseq@master
Installing phyloseq
Skipping 4 packages not available: BiocGenerics, Biostrings, DESeq2, multtest
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore CMD  \
  INSTALL '/tmp/Rtmpm7oa1a/devtools2137cf4daa7/joey711-phyloseq-dd34ad0'  \
  --library='/home/bender/R/x86_64-pc-linux-gnu-library/3.2' --install-tests 
* installing *source* package ‘phyloseq’ ...
** R
** data
** inst
** tests
** preparing package for lazy loading
Note: the specification for S3 class “AsIs” in package ‘RJSONIO’ seems equivalent to one from package ‘BiocGenerics’: not turning on duplicate class definitions for this class.
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
Note: the specification for S3 class “AsIs” in package ‘RJSONIO’ seems equivalent to one from package ‘BiocGenerics’: not turning on duplicate class definitions for this class.
* DONE (phyloseq)
Reloading installed phyloseq
Warning message:
Username parameter is deprecated. Please use joey711/phyloseq 
> library(phyloseq)
> packageVersion("phyloseq")
[1] ‘1.13.2’
jnpaulson commented 9 years ago

Hi @LBragg,

Thanks for the research so far into the problem and apologies for the delay!

If you could email me a MRE (minimally reproducible example) that you're working with, that'll really help me debug and fix up the code so that we can get biom-class 2.1 objects imported into phyloseq and metagenomeSeq and functions available in @joey711's biom package.

jnpaulson commented 9 years ago

@LBragg Thanks for the MRE - here's a link to the now updated import_biom2.R script https://gist.github.com/nosson/324ac1fa3eab1bc7f845 that should fix everything (sorry for the delay, been in Paris for the last week).

jnpaulson commented 9 years ago

@joey711 If you're happy with this, let me know and I can make import_biom flexible and modify import_biom + read_biom and do a pull request when I get stateside in the next two weeks.

joey711 commented 9 years ago

Hey guys, all LGTM! Make sure to think of some unit tests for some of the special cases. I'm happy to push out any/all improvements to get this functionality out there. Feel free to ping me repeatedly if there's a pending PR I haven't noticed.

LBragg commented 9 years ago

The fixes work for me, the biom produced by qiime 1.9 can be imported. As an aside, I noticed when I passed a parameter (I was trying to pass the mapping file) to import_biom2() that wasn't expected, it gets passed to the read_tree function, causing the tree to not be read... It's a trivial thing, but thought I should mention it while you're updating the function =)

Warning message:
In import_biom2(hd5_biom, mapfilename = mapping_file, treefilename = treefilename) :
  treefilename failed import. It not included.
joey711 commented 9 years ago

@LBragg

The warning you showed seems like expected/desired behavior if the treefilename argument is a missing/incorrect file. Was that the case? or were you expecting that tree-file to work?

Btw everyone, @nosson et al, I was imagining we add a new argument to import_biom regarding the biom format version (it might already have one, just currently ignored), and/or we autodetect it. Seems straightforward enough we shouldn't need a completely different function, and might help avoid mistakes (especially if we can do the autodetect).

Thoughts?

joey

LBragg commented 9 years ago

Sounds good to me. As for that issue I was describing, the treefilename argument is the correct tree file, however the tree fails to load if I supply additional, unexpected arguments to the import_biom2() function. I guess they are passed using the ... argument list. Maybe it's more that the read_tree function silently returns NULL if any errors occur, such as the passing of incorrect arguments. No biggie.

jnpaulson commented 9 years ago

Boom! I just made a large number of changes to the BIOM package that should be ready for pulling in and will seamlessly fix the import_biom issue.

Check out: https://github.com/joey711/biom/pull/11

dadahan commented 9 years ago

Hi all,

Sorry to be picking up on an earlier part of this thread so late in the game but I am just now running into my issue.

I am trying to import my biom file, converted from hdf5 to json using Phil's earlier suggestion with the biom convert in QIIME, and receiving the following error:

jsonbiom = system.file("extdata", "otu_table_json.biom", package="phyloseq") treefile = system.file("extdata","rep_set.tre", package="phyloseq") refseq = system.file("extdata","rep_set.fna",package="phyloseq") import_biom(jsonbiom,treefile,refseq, parseFunction = parse_taxonomy_qiime) Error in fromJSON(content, handler, default.size, depth, allowComments, : invalid JSON input

I've tried using a txt file as well, and it just wont seem to work.

Any help is appreciated, Dylan

joey711 commented 9 years ago

that's not a valid file path. the system.file commands are shown in tutorials because they are OS-independent, and the data is within the package itself.

Your file path for your data should just be a standard relative or absolute Unix path string.

jsonbiomfile = "path/to/my/data/file.biom"
import_biom(jsonbiomfile)
joey711 commented 9 years ago

By the way, the early version of the replacement that will solve this HDF5 / Biom-V2 issue is now a separate repository. Plan is to post it on Bioconductor:

https://github.com/joey711/biomformat

The release version of phyloseq package cannot officially support it as an internal dependency until it is released on BioC. The devel version can do so, as soon as the first version of "biomformat" is on BioC.

I will close this issue once the solutions are implemented enough in phyloseq and its doc.

dadahan commented 9 years ago

Hi Joey,

My apologies for not catching this was a tutorial specific command. Your suggestion worked wonderfully. Thanks for the HDF5 info as well.

D

mafed commented 9 years ago

Hi all, about importing a biome file in R: when read_biom reads a biom/HDF5, abundance data are stored in R as numerics instead of integers, that seems a huge waste of memory IMHO. With American Gut Project's I went from 9.2GB to 480MB just by casting to integers.

jnpaulson commented 9 years ago

@mafed interesting - can you perhaps post this in the biomformat package: https://github.com/joey711/biomformat/issues and i'll go ahead and fix this

mafed commented 9 years ago

@jnpaulson sure, I'll do it immediately, thanks

gputzel commented 8 years ago

The documentation for the import_biom function has the following description:

New versions of QIIME produce a more-comprehensive and formally-defined JSON file format, called biom file format

This is not correct, is it? Shouldn't it say "Old versions" instead of "New versions"?

jmicrobe commented 7 years ago

Has the hdf5 biom issue been fixed? I still get the error "Error in nchar(content) : invalid multibyte string 1" when trying to import a biom generated with QIIME 1.9.1. The documentation here seems to imply that after April this problem was fixed. I've checked that my version of phyloseq is up to date.

joey711 commented 7 years ago

This issue should be fixed for some time now. The dependency in phyloseq was migrated in previous release version to point to "biomformat" on Bioconductor, which supports both JSON and HDF5 versions of the format, rather than "biom" on CRAN, which should be considered deprecated. All functionality from "biom" package on CRAN is now subsumed within "biomformat" on Bioconductor.

I don't know what QIIME did or did not do to fix the issue, but AFAIK biomformat is reading these formats just fine.

I will close for now, but feel free @jmicrobe to post more details of your problem if it turns out it is not solved by biomformat/phyloseq.

@jnpaulson feel free to re-open if you think there's something still TBD.

CarlyMuletzWolz commented 6 years ago

Ok, I solved my own issue. But putting comments here in case someone does this too!

Reproducible result:

data("GlobalPatterns")

MGS <- phyloseq_to_metagenomeSeq(GlobalPatterns)

MGS

p <- cumNormStatFast(MGS) MGS <- cumNorm(MGS, p =p)

b <- MRexperiment2biom(MGS, norm = T)

write_biom(b, biom_file = "test.biom")

Error message

biom1 <- import_biom2("test.biom")

You have to read in the biom file first for import biom to work, needs an object not a file

biom2 <-read_biom("test.biom")

biom3 <- import_biom2(biom2)

vanesotjim commented 6 years ago

Hi everyone!

I have my otu_table.biom constructed by Qiime 1.9 and my metadata file and I tried to import in Phyloseq and until now I have not couldn't; I changed my otu_table.biom to json with this command

biom convert -i otu_table.biom -o otu_table_json.biom --table-type="OTU table" --to-json

However, with this file, I have not couldn't, please if you have any advice, I'm new in bioinformatics :)

Thanks in advance

vanesotjim commented 6 years ago

Sorry I forgot to wrote script that I used, it was:

jsonbiomfile = "C/Users/metagenomic/Desktop/otu_table_json.biom" jsonbiomfile = "C/Users/metagenomic/Desktop/metadata_ITS.txt"

biom_file <- paste(jsonbiomfile, "otu_table_json.biom", sep = "") map_file <- paste(jsonbiomfile, "metadata_ITS.txt", sep = "")

read_biom(biom_file)

Is there something wrong??

CarlyMuletzWolz commented 6 years ago

This is what works for me from QIIME 1.9 .json files

BIOM FILE

biom <- import_biom("milk_final.biom", parseFunction = parse_taxonomy_greengenes) warnings()

rank_names(biom)

get error message, but looks fine, adds a rank1 at the end for whatever reason. You can turn off the parse_taxonomy_greengenes maybe...

MAPPING FILE

map <- import_qiime_sample_data("map_milk_final_alpha.txt")

TREE FILE

the tree must be rooted. You can mid-point root it if you want see: http://www.phytools.org/static.help/midpoint.root.html

My tree was rooted with an outgroup in other R code

tree = read.tree("milk_final_rooted.tre")

GM <- merge_phyloseq(biom, tree, map)

When you merge all these files, the OTUs that are in the biom file, but not in the tree file are removed, but these OTUs are still 'linked' to the phyloseq object. It is best to filter them in case they cause some issue. What happens is when you do the alignment in QIIME some of the OTUs don't align (they are often classified as bacteria) and thus don't show up in the tree. It is a good filtering strategy.

GM_data = filter_taxa(GM, function(x) sum(x) != 0, TRUE)

sum(taxa_sums(GM_data) == 0)

ntaxa(GM_data) nsamples(GM_data)

sort(sample_sums(GM_data))

difference in sequencing depth

max(sample_sums(GM_data))/min(sample_sums(GM_data))

Hope that works for you

vanesotjim commented 6 years ago

Thank you Carly, for your answer!!!

I tested this

biom.json

jsonbiomfile = "C:/Users/metagenomic/Desktop/otu_table_json.biom" read_biom(jsonbiomfile) x = read_biom(jsonbiomfile) x2 <- import_biom(x) head(otu_table(x2)) head(tax_table(x2))

Metadata import

metadata = "C:/Users/metagenomic/Desktop/metadata_ITS.txt" read.csv(metadata) t <- import_qiime_sample_data(metadata)

merge two phyloseq objects

x2_merged <- merge_phyloseq(x2, t) colnames(tax_table(x2_merged)) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species") sample_data(x2_merged)

And it worked!!!

Thnaks