try vcf importing into hdf5 using scikit-allel

vatlab / VarStore

High Efficiency genotype data storage library

http://vatlab.github.io/VarStore/

0 stars 0 forks source link

try vcf importing into hdf5 using scikit-allel #18

Open jma7 opened 7 years ago

jma7 commented 7 years ago

I tried importing vcf into hdf5 using scikit-allel following the steps in http://alimanfoo.github.io/2017/06/14/read-vcf.html

The importing step is about 4 times faster compared with our implementation. The timing test shows that it seems the main bottleneck is the LineProcessor. The current version of scikit-allel could parse lines much faster then our LineProcessor. The parser in scikit-allel is written in CPython https://github.com/cggh/scikit-allel/blob/master/allel/opt/io_vcf_read.pyx

BoPeng commented 7 years ago

That was exactly my suspicion because I knew that part could be improved due to the extensive use of lambda function etc there. I was actually surprised that you said other vcf-reading tools are comparable or worse than vtools. Let us switch while keeping the old code for non-VCF input files.

jma7 commented 7 years ago

The other vcf-reading tools are indeed slow because they pack each line into an object first... scikit-allel implementation is different and faster, so we can reference on it to make it work with variantTools.

BoPeng commented 7 years ago

Not sure about multi-processing though. Perhaps there is no need for MP for reading if we use this tool.

jma7 commented 7 years ago

The author mentioned:

EXPERIMENTAL support for multi-threaded parsing N.B., this is not used for the moment, because use of object dtype for strings requires GIL acquisition, and this may hurt performance in a single-threaded context. I'm not completely certain that is the case, but I am out of time to explore further.

The current tool in single thread.

BoPeng commented 7 years ago

GIL does not matter if data are provided in multiple files (e.g. multiple chromosomes like 1000 G) and if we use multi-processing of single-thread reading to read them. The problem is that if we have ten thousand samples (columns in vcf file) in one vcf file and need to save them into different HDF5 files, we will have to use one thread.

I think we should stick to single thread for now.