vatlab / VarStore

High Efficiency genotype data storage library
http://vatlab.github.io/VarStore/
0 stars 0 forks source link

Re-organization of vat-dp and the goals of this project. #15

Open BoPeng opened 7 years ago

BoPeng commented 7 years ago

As you have seen, the repo has been renamed to VarStore, which is not at all fancy and has been used by others such as Variant Store. We can change it to something else if you have some better idea.

I have also revisited ga4gh, despite of its strange name, this is something we should follow so that our data storage model can be used to exchange data with other storage mechanism. ga4gh even has a compliance test suite for us to use if we are to test compatibility.

And the goal of this project is to two fold:

  1. On the variant tools side, makt vtools ga4gh compliant so that it can work with any ga4gh compliant data storage model. We could cache data in whatever format, but the key is to allow variant tools to analyze web-based data.

  2. On the VarStore side, implement a HDF5 storage model that is aimed at highly efficient data storage and retrieval model for association analysis..

gaow commented 7 years ago

Sounds great! But I'm wondering what user cases these efforts are trying to help with, and what will happen to the current version of vtools.

but the key is to allow variant tools to analyze web-based data.

Or rather, web-based annotation databases? I think current vtools is good for many users but keeping local copies of annotation databases and having to frequently update them is the biggest headache.

aimed at highly efficient data storage and retrieval model for association analysis..

This is essentially saying we want to implement a good export interface that saves data to formats that other software eg rvtests can pick up (cleaned VCF files for rvtests), and specifically design a good way to interact with R-based methods. I dont think we have the manpower to keep our existing association testing module comparable to what rvtests can offer. And there may be more association methods to come. So I guess this is saying we essentially replace SQLite based genotype DB to something varstore will adopt, and make sure all features previously available from vtools will work here?

You see my two comments are both related to improving what vtools/vat is already offering. Frankly I'm not sure if we want to go too far at some generic "web-based" data analysis unless there is a motivating user case from either us or our collaborators, because I do not know what we are going to analyze ... but maybe you have something concrete in mind?

BoPeng commented 7 years ago

Essentially speaking, I would like to have a storage model optimized for association analysis, which is the goal of this project. Then, ga4gh is a web interface that allows us to access large amount of data. What I meant was to provide a ga4gh interface for our storage model, at least partially, and VAT should then be able to download other data through ga4gh and save in OUR storage model for efficient analysis.

gaow commented 7 years ago

Ok then there are two essential elements: the need to "download" other data for VAT analysis, and the need to analyze data using our storage model directly. I was trying to argue perhaps we cannot find good user case for either of them: users of association studies often have to download data from dbGaP or just analyze their data, and tools we provide with VAT may soon become obsolete (if not already). The part that I think many users will need and will not obsolete is the data processing utilities we offer, and the annotations -- that's what PLINK 2.0 cannot do. I just worry we make great products yet cannot find anyone who needs them.

BoPeng commented 7 years ago

This depends on the fate of ga4gh, or the internet based data sharing. We will surely support the traditional dbGaP style data aggregation but will have to keep an eye on ga4gh and try to be compatible with it in order to support such internet based data analysis.