joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
582 stars 187 forks source link

Renaming biom file samples imported from Qiime #640

Open bvecchip opened 8 years ago

bvecchip commented 8 years ago

I have installed the most recent versions of both qiime and phyloseq. I'm importing my .biom and mapping files without issue, however, during merge, I'm getting a sample name mismatch error. Upon review, my .biom sample names have been changes to "sa1" etc. I'm unsure as to why this is occurring. The sample names in the .biom and map input files are something like this: APC1234 APC5678 etc. Is there a way to correctly import the .biom sample names, or a work-around? Thanks so much!

bvecchip commented 8 years ago

Specifically, my biom file starts like this (after converting to txt)

OTU ID APC3481

386088 86.0 1073436 5.0 949789 74.0 1036749 187.0 etc.

And my phyloseq command is as follows: otutable <- import_biom(BIOMfilename = 'otu_table_mc2_w_tax_no_pynast_failures2.biom', treefilename = 'rep_set.tre', parseFunction = parse_taxonomy_default)

Here's the check of the sample name: > sample_names(otutable) [1] "sa1"

joey711 commented 8 years ago

I would need more information to know what's going on, especially how you generated the biom file more precisely. It most likely means that your biom file is missing the sample identifiers, for some reason. If they were present phyloseq would use them. The "sa1" convention is for only used when a data object is missing its sample identifiers.

Try the following:

biomformat::read_biom("otu_table_mc2_w_tax_no_pynast_failures2.biom")

and report back what it says. I'm guess it does not have sample names...

bvecchip commented 8 years ago

So after running the command, it does appear samples names are missing.

biomformat::read_biom("otu_table_mc2_w_tax_no_pynast_failures.biom") biom object. type: OTU table matrix_type: dense 89 rows and 1 columns Warning message: In strsplit(msg, "\n") : input string 1 is invalid in this locale

This is confusing to me, given that I can see the sample name in my header line after converting to txt. This particular biom file is from qiime's pick_open_reference script (using already demultiplexed PE reads), and sample IDs were included in the mapping file.

Mapping file: SampleID BarcodeSequnce LinkerPrimerSequence Description APC3481 Testing Mapping File

Generation of biom file: split_libraries_fastq.py -i fastqjoin.join.fastq -o ./APC3481 --barcode_type 'not-barcoded' -q 19 --sample_ids APC3481 -m map.txt pick_open_reference_otus.py -i seqs.fna -o ./out -a -O 6 -r /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta -p params.txt

After conversion:

OTU

ID APC3481 386088 86.0 1073436 5.0 949789 74.0 1036749 187.0 etc.

If these are the first two columns of the biom file, how should the sample ID in the header be delineated?

joey711 commented 8 years ago

That's a good question for @wasade as he could probably answer that off the top of his head.

There are two possibilities:

The ambiguity lies in the fact that there are multiple ways to store information about samples in the biom-formats, but an API that handles these things needs to know the difference between a sample identifier (unique key) and other arbitrary data on the samples. I think this is solved exclusively by slot naming convention in biom-format, but I don't know for sure, hence I'm roping in @wasade O:-)

wasade commented 8 years ago

The format only allows for a single way to define sample IDs in BIOM 1.x and for a single way in BIOM 2.x as well, and the storage of the IDs is logically distinct from the arbitrary metadata.

@bvecchip, the most likely scenario is that this is a bug in the R API. What I recommend is running biom summarize-table and see what comes out as sample names.

joey711 commented 8 years ago

Of course you'd blame my API and not QIIME! O:-)

Other biom files import with Sample IDs just fine, btw...

bvecchip commented 8 years ago

That you both for the input. Running the sumarize-table command gives the following:

Num samples: 1 Num observations: 89 Total count: 7742 Table density (fraction of non-zero values): 1.000

Counts/sample summary: Min: 7742.0 Max: 7742.0 Median: 7742.000 Mean: 7742.000 Std. dev.: 0.000 Sample Metadata Categories: None provided Observation Metadata Categories: taxonomy

Counts/sample detail:

wasade commented 8 years ago

@joey711 heh :)

@bvecchip, looks like the ID isn't stored and is represented as an empty string, or something. Never seen output from biom summarize-table like that before. So next guess is a malformed pick_open_reference_otus.py call. Can you open an issue on the qiime-forum?

joey711 commented 8 years ago

BTW, @wasade I've actually received some other emails/posts related to this, I just didn't know the source of the problem. Thanks for clarifying.

@bvecchip , when it is resolved on the QIIME side, can you post back here?

I will close this for now because it's not an issue with phyloseq or the related R stack.

@wasade do you have any suggestions for workarounds that would be visible to other users running into this same problem?

Thanks!

wasade commented 8 years ago

I do not know what the problem is as I haven't encountered it before, so I don't know what to suggest other than open an issue on the qiime-forum. Looking back at the commands, it could be due to split_libraries_fastq.py not prefixing the sample ID on to the sequence records. A check of head on seqs.fna would answer that, but again, best to centralize this response on the qiime-forum.

@bvecchip, just to verify, that biom summarize-table output is complete? I'm unable to reproduce a biom table that produces similar output. The closest I can get is one which reports an empty sample ID, but still associates sequence counts from it.

$ biom summarize-table -i foo.biom
Num samples: 1
Num observations: 2
Total count: 1
Table density (fraction of non-zero values): 0.500

Counts/sample summary:
 Min: 1.0
 Max: 1.0
 Median: 1.000
 Mean: 1.000
 Std. dev.: 0.000
 Sample Metadata Categories: None provided
 Observation Metadata Categories: None provided

Counts/sample detail:
: 1.0
bvecchip commented 8 years ago

@wasade that is the complete output. Very strange right?

My sample names are appended onto the seqs in the fna file:

APC3481_0 M01109:106:000000000-AN891:1:1101:16341:1512 1:N:0:135 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 CCTACGGGAGGCTGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTTACCCGGGATGATAATGACAGTACCGGGAGAATAAGCTCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTTTGTAAGTCAGAGGTGAAAGCCTGGAGCTCAACTCCAGAACTGCCTTTGAGACTGCATCGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCGGTAGTC

Will place on qiime and report back.

I imagine a possibly work around would be to convert the biom to txt, somehow modify the sample ID (unsure of what needs to be done here), and then convert back to biom. Is this possible? I'm very new to biom files.

wasade commented 8 years ago

It's doable with biom convert --to-tsv

On Jul 15, 2016 10:34 AM, "bvecchip" notifications@github.com wrote:

@wasade https://github.com/wasade that is the complete output. Very strange right?

My sample names are appended onto the seqs in the fna file:

_APC34810 M01109:106:000000000-AN891:1:1101:16341:1512 1:N:0:135 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0

CCTACGGGAGGCTGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTTACCCGGGATGATAATGACAGTACCGGGAGAATAAGCTCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTTTGTAAGTCAGAGGTGAAAGCCTGGAGCTCAACTCCAGAACTGCCTTTGAGACTGCATCGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCGGTAGTC

Will place on qiime and report back.

I imagine a possibly work around would be to convert the biom to txt, somehow modify the sample ID (unsure of what needs to be done here), and then convert back to biom. Is this possible? I'm very new to biom files.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/640#issuecomment-233016622, or mute the thread https://github.com/notifications/unsubscribe-auth/AAc8siRq9e8N1BEOsyAvAgLqaOACUC-Bks5qV8SpgaJpZM4JMghs .

bvecchip commented 8 years ago

Yes, I've used this script, but cannot find the problem with the biom file format. The sample name appears in the header line (is this where they should be?):

OTU ID APC3481

386088 86.0 1073436 5.0 949789 74.0 1036749 187.0 etc.

bvecchip commented 8 years ago

Hi all, I've worked with qiime to get my output corrected (library needed updated, but no error thrown during biom generation). My output of summarize_table now shows a sample name:

Num samples: 1 Num observations: 267 Total count: 61196 Table density (fraction of non-zero values): 1.000 Counts/sample summary: Min: 61196.0 Max: 61196.0 Median: 61196.000 Mean: 61196.000 Std. dev.: 0.000 Sample Metadata Categories: None provided Observation Metadata Categories: taxonomy Counts/sample detail: APC3481: 61196.0

However, importing into phyloseq still doesn't incorporate sample name into otutable:

otutable <- import_biom(BIOMfilename = 'otu_table_mc2_w_tax_no_pynast_failures.biom', treefilename = 'rep_set.tre', parseFunction = parse_taxonomy_greengenes)

(warnings are just a couple of greengenes parsing fails / adding dummy taxa)

sample_names(otutable) [1] "sa1"

Can I send along my biom?

wasade commented 8 years ago

@joey711 I think this is outside of qiime/biom now as the sample is being reported by that machinery...

joey711 commented 8 years ago

let's not be so hasty. no evidence provided that sample names in the expected place in the biom file... the fact that they're missing after import implies they are not...

wasade commented 8 years ago

The format only allows for a single place that a sample ID can be defined. This is the original table from @bvecchip prior to the correction she made this morning:

(py35-biom)10:13:26 (mcdonadt@Daniels-MacBook-Pro):Downloads$ ipython

10:13:28 (mcdonadt@Daniels-MacBook-Pro):~/Downloads> import h5py

10:13:30 (mcdonadt@Daniels-MacBook-Pro):~/Downloads> f = h5py.File('otu_table_mc2_w_tax_no_pynast_failures.biom')

10:13:35 (mcdonadt@Daniels-MacBook-Pro):~/Downloads> f['sample/ids'][:]
Out[3]: array(['APL000003475'], dtype=object)
bvecchip commented 8 years ago

Hi @joey711 , Let me know if you would like a copy of the biom file in question. Given the evidence so far, I do think that phyloseq import is likely my problem. Hoping to resolve this so that I can get the most out of this large dataset using your software. I appreciate all your help, -Briana

joey711 commented 8 years ago

Please send the file and let me know what I should be expecting (number of samples, taxa, covariates...)

joey711 commented 8 years ago

Alright, I looked at the file. I don't think this is an issue with biom files generated by QIIME generally (because I'd hear more about it, and I've had it work plenty of times), but it does seem to be an issue when your data has only one sample. Which is a strange corner case for phyloseq, since many of the advantages of phyloseq (or analysis in general) are not available when N=1.

In any case, the workaround is pretty straightforward:

library(biomformat)
library(phyloseq)
bm1 = read_biom("otu_table_mc2_w_tax_no_pynast_failures.biom")
ps1 = import_biom(bm1, parseFunction = parse_taxonomy_greengenes)
sample_names(ps1) <- colnames(bm1)

You'll notice the last line is solving the phyloseq::import_biom failure to pass along the (one) sample name. Unclear to me why you need it, since there's only one, but it's helpful for me to know that this is happening. I should probably have the phyloseq-side take pass the index names more explicitly using the rownames and colnames methods that we've written into the biomformat package.

There's a little bit a of a rabbit-hole I'll have to dive down into to figure out where this is lost, but it is typical of the base "matrix" class in R, which extends vectors, a side effect being that they default to reverting to a simple vector when one of the indices has length of one.

I'll re-open for now to remind me to do something. The code above will solve your problem for now...

bvecchip commented 8 years ago

Thanks so much. I'll give this a go and also try n>1. I always test out my pipelines on n=1 before expanding to the entire cohort. I can't say I've ever seen an issue like this, but will definitely remember it in the future!

joey711 commented 8 years ago

maybe test on n=2 from now on. That way you are including how you're handling sample covariate data as well. This part is too often neglected in "pipelines"...

amulyashastry commented 7 years ago

I think this happened to me when the reference database was not mentioned while running pick_otus step.