read compressed alignment files?

matsen / pplacer

Phylogenetic placement and downstream analysis

http://matsen.fredhutch.org/pplacer/

GNU General Public License v3.0

74 stars 18 forks source link

read compressed alignment files? #323

Closed nhoffman closed 10 years ago

nhoffman commented 10 years ago

A low priority request/question: can pplacer and friends be made to read compressed alignment files (eg, .gz or .bz2) natively? This could potentially save a lot of room on disk for work in progress.

matsen commented 10 years ago

Would Zip suit? I know it's not hip, even a little bit, but we are already using it.

https://github.com/matsen/pplacer/blob/dev/pplacer_src/refpkg_parse.ml

cmccoy commented 10 years ago

camlzip supports zip and gzip; we could use that without adding dependencies.

On Mon, Nov 18, 2013 at 11:12 AM, Erick Matsen notifications@github.comwrote:

Would Zip suit? I know it's not hip, even a little bit, but we are already using it.

https://github.com/matsen/pplacer/blob/dev/pplacer_src/refpkg_parse.ml

— Reply to this email directly or view it on GitHubhttps://github.com/matsen/pplacer/issues/323#issuecomment-28727910 .

matsen commented 10 years ago

Ah, nice. I'll have a go on this some afternoon.

On Mon, Nov 18, 2013 at 11:16 AM, Connor McCoy notifications@github.comwrote:

camlzip supports zip and gzip; we could use that without adding dependencies.

On Mon, Nov 18, 2013 at 11:12 AM, Erick Matsen notifications@github.comwrote:

Would Zip suit? I know it's not hip, even a little bit, but we are already using it.

https://github.com/matsen/pplacer/blob/dev/pplacer_src/refpkg_parse.ml

— Reply to this email directly or view it on GitHub< https://github.com/matsen/pplacer/issues/323#issuecomment-28727910> .

— Reply to this email directly or view it on GitHubhttps://github.com/matsen/pplacer/issues/323#issuecomment-28728274 .

Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/

nhoffman commented 10 years ago

gzip would probably be preferable for single files if available

matsen commented 10 years ago

Camlzip knows how to read bytes, characters, and sets thereof. When we read in fasta (e.g.) files, they get read in line by line and tokenized (see ppatteries for the definition of gen_parsers). It's important for the tokenize functions that things arrive a line at a time. We could read chunks of our compressed file at a time (say, 80 chars), look for newlines in them and spit out an Enum of lines as the newlines appear. Does that seem reasonable? Will that be sufficiently efficient?

cmccoy commented 10 years ago

Maybe we could hook into the Batteries I/O interface? IO.create_in (for camlzip in_channel) combined with IO.lines_of would give an Enum of strings.

matsen commented 10 years ago

Excellent!

On Mon, Nov 18, 2013 at 5:19 PM, Connor McCoy notifications@github.comwrote:

Maybe we could hook into the Batteries I/O interface? IO.create_inhttp://ocaml-batteries-team.github.io/batteries-included/hdoc2/BatIO.html#VALcreate_in(for camlzip in_channel) combined with IO.lines_ofhttp://ocaml-batteries-team.github.io/batteries-included/hdoc2/BatIO.html#VALlines_ofwould give an Enum of strings.

— Reply to this email directly or view it on GitHubhttps://github.com/matsen/pplacer/issues/323#issuecomment-28757285 .

Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/

cmccoy commented 10 years ago

This is working for me on the microbiome demo - sequence files ending in .gz get decompressed with gzip.

I added .jplace.gz compression support while I was there. Doing so required changing the JSON parser from acting on raw input channels to a wrapped Batteries IO.input (see b4e10c4e46b0e119aeea0d2968b82566f1246cb2) or IO.output (see d4701a07f981061ca6edaf7f431fb627e510418e). I'm hoping that doesn't incur any serious performance overhead. Happy to test more or drop those parts.

matsen commented 10 years ago

Whiplash. Nice work, and glad to see tests.