Open jma7 opened 7 years ago
That was exactly my suspicion because I knew that part could be improved due to the extensive use of lambda function etc there. I was actually surprised that you said other vcf-reading tools are comparable or worse than vtools. Let us switch while keeping the old code for non-VCF input files.
The other vcf-reading tools are indeed slow because they pack each line into an object first... scikit-allel implementation is different and faster, so we can reference on it to make it work with variantTools.
Not sure about multi-processing though. Perhaps there is no need for MP for reading if we use this tool.
The author mentioned:
EXPERIMENTAL support for multi-threaded parsing N.B., this is not used for the moment, because use of object dtype for strings requires GIL acquisition, and this may hurt performance in a single-threaded context. I'm not completely certain that is the case, but I am out of time to explore further.
The current tool in single thread.
GIL does not matter if data are provided in multiple files (e.g. multiple chromosomes like 1000 G) and if we use multi-processing of single-thread reading to read them. The problem is that if we have ten thousand samples (columns in vcf file) in one vcf file and need to save them into different HDF5 files, we will have to use one thread.
I think we should stick to single thread for now.
I tried importing vcf into hdf5 using scikit-allel following the steps in http://alimanfoo.github.io/2017/06/14/read-vcf.html
The importing step is about 4 times faster compared with our implementation. The timing test shows that it seems the main bottleneck is the LineProcessor. The current version of scikit-allel could parse lines much faster then our LineProcessor. The parser in scikit-allel is written in CPython https://github.com/cggh/scikit-allel/blob/master/allel/opt/io_vcf_read.pyx