Closed abrown25 closed 4 years ago
Hm... when I test on my machine, it hangs. Doesn't even count one line.
Thanks very much for looking into this. If I replace the iopipe version with: dependency "iopipe" version="~>0.2.0" then it works:
Performing "debug" build using dmd for x86_64. io 0.2.5: target for configuration "library" is up to date. iopipe 0.1.7: target for configuration "library" is up to date. hello ~master: building configuration "application"... Linking... To force a rebuild of up-to-date targets, run again with --force. Running ./hello 1103800
which matches python: python -c "import gzip; print(sum(1 for x in gzip.open('../ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz', 'rt')))" 1103800
I'm running this on Ubuntu 20.04 if that helps explain why I get a -9 exit code and it hangs for you. It also doesn't need to read the whole file to fail. I get the same outcome if I put a break in after reading 5 lines.
It's not reading the file properly. I'm still trying to figure out what's happening, it never finds a line end, and so the system runs out of memory.
It's a problem in the line counter. nothing to do with zip, which seems to be reading the data just fine. I had to do a LOT of refactoring to get it to work for @safe
.
You are going to laugh, when you see the mistake I made (and it took me a while to figure out why my unittests didn't catch it). PR incoming.
I would never laugh at the man who's making it possible for me to do my job! Thank you, it works perfectly now. Would you be interested in a pull request for the examples folder? A simple example of iterating over a gzipped text file (name taken from the arguments) may be useful to people processing files line by line?
making it possible for me to do my job!
Great to know this project is helping you! Let me know if there's anything else that you find.
I did note that gzcat | wc -l
works much faster than your sample. But I think that's more to do with zlib than with iopipe (See #14)
A simple example of iterating over a gzipped text file (name taken from the arguments) may be useful to people processing files line by line?
Hm... I wonder if just adding a parameter to the byline range example to indicate the input is a zipped stream? It would also showcase how straightforward it is to construct pipes to handle many cases with one implementation.
Either way, I would be glad to merge something from you.
With this file: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz the following code fails:
!/usr/bin/env dub
/+ dub.sdl: name "hello" dependency "iopipe" version="~>0.2.0" dependency "io" version="*" +/
import std.stdio; import std.typecons; import iopipe.textpipe; import iopipe.zip; import iopipe.bufpipe; import std.io : File = File;
void main() { auto counter = 0;
foreach (line; File("ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz").refCounted.bufd.unzip(CompressionFormat.gzip).assumeText.byLineRange!false) { counter++; }
writeln(counter); }