vatlab / VarStore

High Efficiency genotype data storage library
http://vatlab.github.io/VarStore/
0 stars 0 forks source link

A matrix-like storage format #8

Open BoPeng opened 7 years ago

BoPeng commented 7 years ago

https://hail.is/hail/overview.html#overview-vds

This would well be our storage library.

gaow commented 7 years ago

Yes we once brought it up #3. I never have had experience with it though. I think we'd have to talk to someone at Broad, or look into developer's profile, to decide if it is something we can totally rely on.

BoPeng commented 7 years ago

It was my fault, I checked hail, did not notice the matrix format, but saw something like alpha/active development and gave up. Now that genomeAD is released, I guess it should be more or less ready for production.

jma7 commented 7 years ago

I went through their tutorial. Several questions:

  1. Hail could read vcf files from the same sample sets but separated by chromosomes all together, but it seems vcf files from different sample sets will be stored into different vds files? Say if we want to get the genotypes of all samples, we need to check each vds file?
  2. Since vat-dp is a web service, the result exported from hail still need to be transformed to json or some other format.
  3. The implementation of hail is in python 2.7, and we are using python 3 for vat, some of the functions not working.
BoPeng commented 7 years ago
  1. This is ok because storing genotypes in different files keep files small and allow potential parallel processing.
  2. Yes, but this is something I had in mind before the elastic search idea. It might be less resource intensive.
  3. It is not difficult to port 2.7 to 3.x, especially if Hail is a thin wrapper to a C/C++ core.
jma7 commented 7 years ago

https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.split_multi Hail's way for multiallelic variants, is this satisfiable? It seems there is not an ID field for variants, I think we could add the variantID got from variant server as an annotation to the variant and save in VDS file.

BoPeng commented 7 years ago

This is acceptable although it does not solve the problem of removing one variant, and the genotype would be recognized as a regular heterozygous (which we differentiate with -1). Given that using -1 will significantly complicate the code and is still not perfect (because we still do not know where the other -1 is), let us adopt Hail's solution.

Yes, we can use any ID for now.

jma7 commented 7 years ago

I did some testing on the Hail using chr22 vcf file from 1000 genomes project, there are about 1.1 million variants and 2504 samples, the VDS file is around 250MB on disk. The queries we'd like to make is query genotypes with sample name or variantID. The variantID here is the hgvsID given by variant server and annotated back into variant annotation field in VDS file. Based on the suggestions from hail team, import_vcf should only be used once at the beginning as hc.import_vcf(filepath).write(vdspath), then read in VDS file as hc.read(vdspath) for following operation. This step increases the performance a lot.

On my mac mini with 8G memory and 8 cores, Query by sample name takes about 30 secs: vds.filter_samples_expr('s=="HG00096"').query_genotypes('json(gs.map(g=>{gt:g.gt,hgvsID:va.vID.hgvsID}).collect())') Query by variant ID takes about 10 secs: vds.filter_variants_expr('va.vID.hgvsID=="chr22:g.16211244G>T"').query_genotypes('json(gs.map(g=>{s:s.id,gt:g.gt}).collect())')

There is not much memory usage, but a lot of disk io, all cores on the desktop were used when running, if more cores available, the running time will be faster accordingly.

jma7 commented 7 years ago

A post about gvcf and Hail. http://discuss.hail.is/t/adding-samples-incrementally/85

BoPeng commented 7 years ago

So Hail is more limited than we originally expected.

jma7 commented 7 years ago

Hail is a project under development, you could expect a lot of changes.

jma7 commented 7 years ago

About HDF5, so I saved VCF genotypes in scipy sparse matrix and then into HDF5, the retrieval of genotypes by row is very fast, 10000 variants of all samples in 2.5 secs. The retrieval of submatrix takes longer, 10000 variants of 20 samples takes 9 secs.

Meanwhile I found a python package to deal with VCF in HDF5 format, have you heard about it or want to try? https://github.com/alimanfoo/vcfnp

BoPeng commented 7 years ago

No I have not heard this package but I would not hesitate to use it if it handles multi-allele cases correctly.

gaow commented 7 years ago

@jma7 I've used this package back in 2014. I think I had a benchmark somewhere that was quite a big improvement over vtools. In fact vcfnp was what motivated us to look into HDF5 as an option. I thought our main problem back then with sparse HDF5 was we cannot properly store genotype annotation information (GD, GQ etc) which are not sparse.

jma7 commented 7 years ago

vcfnp doesn't have the option to store genotype as sparse matrix into HDF5 neither. We can use it to store variants information. It seems we have to store genotype annotation information as dense matrix, since they are not sparse. Or maybe just leave those information in the VCF file and get them through some other ways if not used frequently? How often do you need genotype annotation information? I didn't see those fields in the Chr22 1000 genomes VCF file.

BoPeng commented 7 years ago

It appears to me that support for gvcf is critically important and we are going to store a lot more information for analysis, enough to reconstruct gvcf file and perform joint calling. So I think the priority is to mae sure that the library (vcf + hdf5) can store all information from gvcf file. Otherwise we might have to use TileDB, which from what I read support both sparse and dense results (whereas genomeDB only store sparsely).

jma7 commented 7 years ago

Based on what I have read, gvcf file is one sample per file? Is it necessary to save gvcf into hdf5? If you want to analysis multiple gvcf, you can merge them as gvcf files first and transform to vcf and then do the analysis.

jma7 commented 7 years ago

Is gvcf only generated from GATK? 10K genomes doesn't provide gvcf files, do we need to generate those files ourselves?

http://lists.ensembl.org/pipermail/dev/2016-August/012079.html

BoPeng commented 7 years ago

gvcf can be multi-sample but then it will have ten times more sites because each gvcf have different "sites". It would be much better to store such information in a sparse manner. One scenario we have discussed was that

  1. we store variants and non-variant sites in DP
  2. each DP report variants to VS.
  3. VAT gets a combined list of variants from VS.
  4. VAT request ALL variants from each DP.
  5. DP recalls variants from non-variant sites (if information available for subset of samples) and return genotype on all variants.
gaow commented 7 years ago

Sorry got stuck in a meeting ... @jma7 yes I remember now. Back then we implemented a version of sparse HDF5 along the lines of some blog posts.

Or maybe just leave those information in the VCF file and get them through some other ways if not used frequently? How often do you need genotype annotation information?

Not very often, I agree, not after some initial QC.

But I should point out that the genotype matrix itself can also be dense in many cases -- for imputed data! All the genotype data I work today are imputed and are not integers. But I'm not sure what is going to happen in the future. If we have full genome sequence for all our samples there will not be a need to impute. Still it is helpful to keep it in mind that we do not necessarily work with sparse matrix.

We are applying for UK Biobank data for a project I'm working on. We have not got the data yet. But it has 500,000 samples on 50 traits. I'm eager to see what the data looks like. Hopefully it is sparse ...

gaow commented 7 years ago

Imputed data in VCF may look like:

0/1:0.92

where the genotype is the most-likely genotype and the annotation is the allelic dosage. In a lot applications we do want to use the dosage in place of genotype.

gaow commented 7 years ago

Any new progress on this matter? I just want to add one more factor to consider: if we do want to come up with some offline version to replace the current variant tools then perhaps HDF5 format is very appealing because it has an R interface that makes integrating downstream analysis methods quite easily. This was an important aim in our grant. I just worry if we come up with a solution very specific and only optimized for online data storage / distribution it will limit its potential to be adopted by individual researchers. PLINK binary format is a great example of such -- many other programs adopts PLINK binary format. HDF5 will be even better because it is more generic.

Also as I previously pointed out, genotype data are not sparse in many cases so we may want to have versions for both. But we can surely get away with float16 that will reduce the size to 25% (compared to float64)

gaow commented 7 years ago

I'm asking because I'm getting to the point in my eQTL mapping method project of applying it to data. I envisage an SoS pipeline for my eQTL method and others, that is reproducible, works on an HPC, and all data will be in HDF5 (I'm going to write a converter myself) that loads quickly and allow for potentially other R based software to easily operate on. eQTL analysis will involve range based annotations (cisSNPs to genes), and functional annotations if I decide to apply some informed priors. It would be nice to contribute this product as part of new VAT.

BoPeng commented 7 years ago

We are testing HDF5.

BoPeng commented 7 years ago

BTW, we have decided to fix "local" storage before we come up with something distributed, so the focus of development is moving to the VariantTools repository, where a new branch will be used for HDF test/move.

gaow commented 7 years ago

Oh ok this is great move! So i'll keep an eye there. In the mean time i'll perhaps do my project my way via pandas interface to HDF5. I will use pandas.read_cvs to load by chunk my genotype data and append to an HDF5, I guess -- unless the import part of HDF5 branch will be good for alpha use very soon.

Frankly I think fixed storage in generic format is more appealing to the community and I can make concrete contributions. I assume by porting the rare variant stuff over, plus some contributions hopefully from Leal lab on newer methods, and mine on eQTL and GWAS we should have something of decent popularity. We can always make distributed version for the next proposal :)

BoPeng commented 7 years ago

The HDF5 will not be usable for a while because variant tools is large (which is now a burden?) and many commands handle genotype information. The plan is to first create a separate genotype storage class/module to separate the function from the core of vtools, form an API, then replace the implementation with HDF5 (and possibly something else).

gaow commented 7 years ago

The plan is to first create a separate genotype storage class/module

This is good enough to me. I'll migrate my procedure over once it is in place.

BoPeng commented 7 years ago

OK. I cannot say I am wrapping up sos now but the last important piece is done (click the task ID to check status of tasks, and I even have a nice CPU/MEM usage plot there :-) so the rest should be debugging and performance tuning. I should be able to spend more time on vtools soon.

BoPeng commented 7 years ago

BTW, you should then revisit your bugs and try to confirm/resolve them.

gaow commented 7 years ago

BTW, you should then revisit your bugs and try to confirm/resolve them.

I'm inclined to obsolete my association tests altogether and try to make good interface with rvtests (Zhan et al 2016). In fact rvtests also borrows codes from VAT. It currently has more tests than VAT, as we have not added any new ones to VAT since 2013. The only advantage of VAT association testing code is that it was modular and others could potentially derive from the code. But apparently I do not think it is valued by others. Another type of modularity, ie, connecting well with other dedicated tools for association testing, is perhaps the way to go. There is also less burden for us to maintain it (I personally no longer do any rare variant association studies now).

BoPeng commented 7 years ago

I was referring your tickets on SoS, such as sos remove.

gaow commented 7 years ago

Oh oh yes, sorry ,getting there eventually!

gaow commented 7 years ago

In case it is useful here is a GData class I wrote in 2014 that explores storing phased genotype data as dense or sparse HDF5 matrix.

'''
http://stackoverflow.com/questions/11129429/storing-numpy-sparse-matrix-in-hdf5-pytables/22589030#22589030
https://pytables.github.io/usersguide/tutorials.html
http://www.philippsinger.info/?p=464
https://pytables.github.io/usersguide/optimization.html
'''
import numpy as np
from scipy import sparse
import tables as tb
import re
import gzip
import os

class GData(dict):
    def __init__(self, data, name, msg = None, gtype = 'haplotype', close = True):
        self.__group = re.sub(r'[^a-zA-Z0-9_]', '_', name)
        self.__msg = msg
        self.__genotype_name = gtype
        self.__close = close
        self[self.__genotype_name] = [[]]
        try:
            if type(data) is dict:
                self.update(data)
            elif type(data) is str:
                # is file name
                self.__load(tb.open_file(data))
            else:
                # is file stream
                self.__load(data)
        except tb.exceptions.NoSuchNodeError:
            raise ValueError('Cannot find dataset {}!'.format(name))
        self.tb_filters = tb.Filters(complevel=9, complib='bzip2')

    def sink(self, filename):
        with tb.open_file(filename, 'a') as f:
            if not f.__contains__('/{}'.format(self.__group)):
                f.create_group("/", self.__group, self.__msg if self.__msg else 'variant & genotype data')
            else:
                if self.__msg:
                    eval('f.root.%s'%self.__group)._v_title = self.__msg
            for key in self:
                foo = self.__store_sparse_genotype if key == self.__genotype_name else self.__store_array
                foo(key, f)
            f.flush()

    def compress(self):
        if self[self.__genotype_name].__class__ != sparse.csr.csr_matrix:
            self[self.__genotype_name] = self.__compress(self[self.__genotype_name])

    def decompress(self, tolist = True):
        def convert(item):
            # implementation of sparse matrix to numpy 2D array or list
            # return item.todense().view(np.ndarray)
            if tolist:
                return item.todense().tolist()
            else:
                return item.todense()
        if self[self.__genotype_name].__class__ == sparse.csr.csr_matrix:
            self[self.__genotype_name] = convert(self[self.__genotype_name])

    def decompress_t(self, tolist = True):
        def convert(item):
            # implementation of sparse matrix to numpy 2D array or list
            # return item.todense().view(np.ndarray)
            if tolist:
                return item.todense().T.tolist()
            else:
                return item.todense().T
        if self[self.__genotype_name].__class__ == sparse.csr.csr_matrix:
            self[self.__genotype_name] = convert(self[self.__genotype_name])

    def __load(self, fstream):
        # load data to dictionary 'self'
        is_gt_loaded = False
        for element in fstream.list_nodes('/{}'.format(self.__group)):
            if element.name.startswith(self.__genotype_name):
                if not is_gt_loaded:
                    self[self.__genotype_name] = self.__load_sparse_genotype(fstream)
                    is_gt_loaded = True
                else:
                    continue
            else:
                self[element.name] = element[:]
        if self.__close:
            fstream.close()

    def __compress(self, item):
        return sparse.csr_matrix(np.array(item, dtype=np.int8))

    def __roll_back(self, group, name):
        try:
            n = getattr(group, name)
            n._f_remove()
        except AttributeError:
            pass

    def __store_array(self, name, fstream):
        element = eval('fstream.root.%s' % self.__group)
        arr = self[name]
        if type(arr) is list:
            arr = np.array(arr)
        self.__roll_back(element, name)
        #
        if arr.shape != (0,):
            ds = fstream.create_carray(element, name, tb.Atom.from_dtype(arr.dtype), arr.shape,
                                       filters = self.tb_filters)
            ds[:] = arr

    def __store_sparse_genotype(self, name, fstream):
        m = self[name]
        if m.__class__ != sparse.csr.csr_matrix:
            m = self.__compress(m)
        #
        attrs = ('data', 'indices', 'indptr', 'shape')
        element = eval('fstream.root.%s' % self.__group)
        for par in attrs:
            full_name = '%s_%s' % (self.__genotype_name, par)
            self.__roll_back(element, full_name)
            #
            arr = np.array(getattr(m, par))
            if arr.shape != (0,):
                ds = fstream.create_carray(element, full_name, tb.Atom.from_dtype(arr.dtype), arr.shape,
                                           filters = self.tb_filters)
                ds[:] = arr
            else:
                # data is empty, have to roll back
                for table in ['%s_%s' % (self.__genotype_name, x) for x in attrs]:
                    self.__roll_back(element, table)
                break

    def __load_sparse_genotype(self, fstream):
        element = eval('fstream.root.%s' % self.__group)
        pars = []
        for par in ('data', 'indices', 'indptr', 'shape'):
            pars.append(getattr(element, '%s_%s' % (self.__genotype_name, par)).read())
        m = sparse.csr_matrix(tuple(pars[:3]), shape=pars[3])
        return m

if __name__ == '__main__':
    # dat = {'haplotype':[1,0,0,0,1], 'variant':[2,3,4,5,6], 'info':['1','2','3','4','5']}
    dat = {'haplotype':[], 'variant':[], 'info':['1','2','3','4','5']}
    dat = GData(dat, 'dat')
    dat.sink('dat.h5')
    dat = GData('dat.h5', 'dat')
    print dat
    dat.decompress()
    print dat
gaow commented 7 years ago

I've been using plink 1.9 for almost 2 weeks now on a 635 samples * 40 million variant data. I initially started this project with vtools, then switched to vcftools which made me feel worse. Finally reluctantly I moved on to plink 1.9. I've got to admit I'm very impressed by its speed and in most cases memory usage. It is huge improvment over plink 1.07. Surely it lacks annotation etc, but its extract feature (given variants ID to select from) is extremely efficient. Importing from VCF to PLINK binary format was amazingly fast with single thread.

Perhaps I'm just thinking aloud now ... shall we combine variant annotation tools with PLINK 1.9? We basically build alternatives to PLINK's bim text format for variant info, and use PLINK to take from there?

BoPeng commented 7 years ago

I heard this because some serious programming effort has been devoted to improve plink and resulted in plink 1.9, but I am not sure if the storage model could be borrowed. Perhaps we should have a look at its code.

gaow commented 7 years ago

Can we take a hybrid, modular approach that utilizes SoS? For example on variant selection level we focus on improving our annotation etc, generating variant ID's for PLINK. Then for example given a list of 36 million variant ID in text file, PLINK is able to subset these ID's from a 54 million variant * 635 sample data-set, both genotype and variants, using 1 minute and 46 seconds.

If we can put together the power of vtools in annotation with the power of plink in data processing even this ugly way, there should be a lot benefit in it already. We can then focus on making annotation better, which byitself can be a lot of work considering the scale of data nowadays.

We only have to think how to overcome its limitations ie what could vtools currently do that PLINK cannot. As long as a "clean" VCF file is provided (genotype fields with only genotype calls not other scores) then PLINK works quite well for me -- except currently multi-allelic site cannot be handled after VCF import, which I can easily hack through by editing via Python variant ID to be unique. However in PLINK 2.0 even this issue will be resolved.