Open bvecchip opened 8 years ago
Specifically, my biom file starts like this (after converting to txt)
OTU ID APC3481
386088 86.0 1073436 5.0 949789 74.0 1036749 187.0 etc.
And my phyloseq command is as follows:
otutable <- import_biom(BIOMfilename = 'otu_table_mc2_w_tax_no_pynast_failures2.biom', treefilename = 'rep_set.tre', parseFunction = parse_taxonomy_default)
Here's the check of the sample name:
> sample_names(otutable) [1] "sa1"
I would need more information to know what's going on, especially how you generated the biom file more precisely. It most likely means that your biom file is missing the sample identifiers, for some reason. If they were present phyloseq would use them. The "sa1"
convention is for only used when a data object is missing its sample identifiers.
Try the following:
biomformat::read_biom("otu_table_mc2_w_tax_no_pynast_failures2.biom")
and report back what it says. I'm guess it does not have sample names...
So after running the command, it does appear samples names are missing.
biomformat::read_biom("otu_table_mc2_w_tax_no_pynast_failures.biom") biom object. type: OTU table matrix_type: dense 89 rows and 1 columns Warning message: In strsplit(msg, "\n") : input string 1 is invalid in this locale
This is confusing to me, given that I can see the sample name in my header line after converting to txt. This particular biom file is from qiime's pick_open_reference script (using already demultiplexed PE reads), and sample IDs were included in the mapping file.
Mapping file: SampleID BarcodeSequnce LinkerPrimerSequence Description APC3481 Testing Mapping File
Generation of biom file:
split_libraries_fastq.py -i fastqjoin.join.fastq -o ./APC3481 --barcode_type 'not-barcoded' -q 19 --sample_ids APC3481 -m map.txt
pick_open_reference_otus.py -i seqs.fna -o ./out -a -O 6 -r /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta -p params.txt
After conversion:
ID APC3481 386088 86.0 1073436 5.0 949789 74.0 1036749 187.0 etc.
If these are the first two columns of the biom file, how should the sample ID in the header be delineated?
That's a good question for @wasade as he could probably answer that off the top of his head.
There are two possibilities:
biomformat
package) is missing support for one of the formal ways in which sample names are supposed to be indicated, or The ambiguity lies in the fact that there are multiple ways to store information about samples in the biom-formats, but an API that handles these things needs to know the difference between a sample identifier (unique key) and other arbitrary data on the samples. I think this is solved exclusively by slot naming convention in biom-format, but I don't know for sure, hence I'm roping in @wasade O:-)
The format only allows for a single way to define sample IDs in BIOM 1.x and for a single way in BIOM 2.x as well, and the storage of the IDs is logically distinct from the arbitrary metadata.
@bvecchip, the most likely scenario is that this is a bug in the R API. What I recommend is running biom summarize-table
and see what comes out as sample names.
Of course you'd blame my API and not QIIME! O:-)
Other biom files import with Sample IDs just fine, btw...
That you both for the input. Running the sumarize-table command gives the following:
Num samples: 1 Num observations: 89 Total count: 7742 Table density (fraction of non-zero values): 1.000
Counts/sample summary: Min: 7742.0 Max: 7742.0 Median: 7742.000 Mean: 7742.000 Std. dev.: 0.000 Sample Metadata Categories: None provided Observation Metadata Categories: taxonomy
Counts/sample detail:
@joey711 heh :)
@bvecchip, looks like the ID isn't stored and is represented as an empty string, or something. Never seen output from biom summarize-table
like that before. So next guess is a malformed pick_open_reference_otus.py
call. Can you open an issue on the qiime-forum?
BTW, @wasade I've actually received some other emails/posts related to this, I just didn't know the source of the problem. Thanks for clarifying.
@bvecchip , when it is resolved on the QIIME side, can you post back here?
I will close this for now because it's not an issue with phyloseq or the related R stack.
@wasade do you have any suggestions for workarounds that would be visible to other users running into this same problem?
Thanks!
I do not know what the problem is as I haven't encountered it before, so I don't know what to suggest other than open an issue on the qiime-forum. Looking back at the commands, it could be due to split_libraries_fastq.py
not prefixing the sample ID on to the sequence records. A check of head
on seqs.fna
would answer that, but again, best to centralize this response on the qiime-forum.
@bvecchip, just to verify, that biom summarize-table
output is complete? I'm unable to reproduce a biom table that produces similar output. The closest I can get is one which reports an empty sample ID, but still associates sequence counts from it.
$ biom summarize-table -i foo.biom
Num samples: 1
Num observations: 2
Total count: 1
Table density (fraction of non-zero values): 0.500
Counts/sample summary:
Min: 1.0
Max: 1.0
Median: 1.000
Mean: 1.000
Std. dev.: 0.000
Sample Metadata Categories: None provided
Observation Metadata Categories: None provided
Counts/sample detail:
: 1.0
@wasade that is the complete output. Very strange right?
My sample names are appended onto the seqs in the fna file:
APC3481_0 M01109:106:000000000-AN891:1:1101:16341:1512 1:N:0:135 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 CCTACGGGAGGCTGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTTACCCGGGATGATAATGACAGTACCGGGAGAATAAGCTCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTTTGTAAGTCAGAGGTGAAAGCCTGGAGCTCAACTCCAGAACTGCCTTTGAGACTGCATCGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCGGTAGTC
Will place on qiime and report back.
I imagine a possibly work around would be to convert the biom to txt, somehow modify the sample ID (unsure of what needs to be done here), and then convert back to biom. Is this possible? I'm very new to biom files.
It's doable with biom convert --to-tsv
On Jul 15, 2016 10:34 AM, "bvecchip" notifications@github.com wrote:
@wasade https://github.com/wasade that is the complete output. Very strange right?
My sample names are appended onto the seqs in the fna file:
_APC34810 M01109:106:000000000-AN891:1:1101:16341:1512 1:N:0:135 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
CCTACGGGAGGCTGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTTACCCGGGATGATAATGACAGTACCGGGAGAATAAGCTCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTTTGTAAGTCAGAGGTGAAAGCCTGGAGCTCAACTCCAGAACTGCCTTTGAGACTGCATCGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCGGTAGTC
Will place on qiime and report back.
I imagine a possibly work around would be to convert the biom to txt, somehow modify the sample ID (unsure of what needs to be done here), and then convert back to biom. Is this possible? I'm very new to biom files.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/640#issuecomment-233016622, or mute the thread https://github.com/notifications/unsubscribe-auth/AAc8siRq9e8N1BEOsyAvAgLqaOACUC-Bks5qV8SpgaJpZM4JMghs .
Yes, I've used this script, but cannot find the problem with the biom file format. The sample name appears in the header line (is this where they should be?):
OTU ID APC3481
386088 86.0 1073436 5.0 949789 74.0 1036749 187.0 etc.
Hi all, I've worked with qiime to get my output corrected (library needed updated, but no error thrown during biom generation). My output of summarize_table now shows a sample name:
Num samples: 1 Num observations: 267 Total count: 61196 Table density (fraction of non-zero values): 1.000 Counts/sample summary: Min: 61196.0 Max: 61196.0 Median: 61196.000 Mean: 61196.000 Std. dev.: 0.000 Sample Metadata Categories: None provided Observation Metadata Categories: taxonomy Counts/sample detail: APC3481: 61196.0
However, importing into phyloseq still doesn't incorporate sample name into otutable:
otutable <- import_biom(BIOMfilename = 'otu_table_mc2_w_tax_no_pynast_failures.biom', treefilename = 'rep_set.tre', parseFunction = parse_taxonomy_greengenes)
(warnings are just a couple of greengenes parsing fails / adding dummy taxa)
sample_names(otutable) [1] "sa1"
Can I send along my biom?
@joey711 I think this is outside of qiime/biom now as the sample is being reported by that machinery...
let's not be so hasty. no evidence provided that sample names in the expected place in the biom file... the fact that they're missing after import implies they are not...
The format only allows for a single place that a sample ID can be defined. This is the original table from @bvecchip prior to the correction she made this morning:
(py35-biom)10:13:26 (mcdonadt@Daniels-MacBook-Pro):Downloads$ ipython
10:13:28 (mcdonadt@Daniels-MacBook-Pro):~/Downloads> import h5py
10:13:30 (mcdonadt@Daniels-MacBook-Pro):~/Downloads> f = h5py.File('otu_table_mc2_w_tax_no_pynast_failures.biom')
10:13:35 (mcdonadt@Daniels-MacBook-Pro):~/Downloads> f['sample/ids'][:]
Out[3]: array(['APL000003475'], dtype=object)
Hi @joey711 , Let me know if you would like a copy of the biom file in question. Given the evidence so far, I do think that phyloseq import is likely my problem. Hoping to resolve this so that I can get the most out of this large dataset using your software. I appreciate all your help, -Briana
Please send the file and let me know what I should be expecting (number of samples, taxa, covariates...)
Alright, I looked at the file. I don't think this is an issue with biom files generated by QIIME generally (because I'd hear more about it, and I've had it work plenty of times), but it does seem to be an issue when your data has only one sample. Which is a strange corner case for phyloseq, since many of the advantages of phyloseq (or analysis in general) are not available when N=1
.
In any case, the workaround is pretty straightforward:
library(biomformat)
library(phyloseq)
bm1 = read_biom("otu_table_mc2_w_tax_no_pynast_failures.biom")
ps1 = import_biom(bm1, parseFunction = parse_taxonomy_greengenes)
sample_names(ps1) <- colnames(bm1)
You'll notice the last line is solving the phyloseq::import_biom
failure to pass along the (one) sample name. Unclear to me why you need it, since there's only one, but it's helpful for me to know that this is happening. I should probably have the phyloseq-side take pass the index names more explicitly using the rownames
and colnames
methods that we've written into the biomformat package.
There's a little bit a of a rabbit-hole I'll have to dive down into to figure out where this is lost, but it is typical of the base "matrix" class in R, which extends vectors, a side effect being that they default to reverting to a simple vector when one of the indices has length of one.
I'll re-open for now to remind me to do something. The code above will solve your problem for now...
Thanks so much. I'll give this a go and also try n>1. I always test out my pipelines on n=1 before expanding to the entire cohort. I can't say I've ever seen an issue like this, but will definitely remember it in the future!
maybe test on n=2
from now on. That way you are including how you're handling sample covariate data as well. This part is too often neglected in "pipelines"...
I think this happened to me when the reference database was not mentioned while running pick_otus step.
I have installed the most recent versions of both qiime and phyloseq. I'm importing my .biom and mapping files without issue, however, during merge, I'm getting a sample name mismatch error. Upon review, my .biom sample names have been changes to "sa1" etc. I'm unsure as to why this is occurring. The sample names in the .biom and map input files are something like this: APC1234 APC5678 etc. Is there a way to correctly import the .biom sample names, or a work-around? Thanks so much!