labsquare / CuteVCF

simple viewer for variant call format using htslib
GNU General Public License v3.0
31 stars 4 forks source link

Add Filtering options #27

Open dridk opened 7 years ago

dridk commented 7 years ago

You should be able to filter based on column name.

dridk commented 7 years ago

@Arkanosis How can I perform sqlite like filtering on C++ list ?

Arkanosis commented 7 years ago

You mean like a SQL WHERE? You can't do that very efficiently (there's a reason why people use sqlite), but assuming you're working on small lists, std::copy_if() / std::remove_copy_if() and a custom predicate might do the trick (that's somewhat expensive, but compared to the cost of displaying the result in Qt, not that much).

dridk commented 7 years ago

Humm.. I mean simple filter like excel does. I think I lost all the benefict of htslib by using sqlite . No ?

Humm.. Open a file, import the file into sqlite , make a query on region ... What do you suggest ?

dridk commented 7 years ago

I think I like the idea of saving all variant as a sqlite file !

Arkanosis commented 7 years ago

It all depends on how big you expect the VCF files to be. For small files, linear filtering is probably cheap enough, but on big ones, I'm afraid it's going to be noticeably slow. sqlite with proper indexes might scale much better but there's an overhead at startup.

I'd suggest linear filtering for typical excel-sized VCF and indexed filtering for anything larger than that (sqlite being the most convenient approach I can thing of).

Now, given it displays every single row of the VCF, I assume CuteVCF is more small-files oriented, isn't it?

dridk commented 7 years ago

CuteVCF should be able to manage big file . Qt Model system is really strong and can support huge amount of line. If I exceed my memory, I can use pagination. So, I will probably make CuteVCF has a strong VCF viewer/filtering application which support different kind of annotation definition. I think this will be really usefull. Too many people use Excel for filtering.

By the way, @Arkanosis How many specification do you know for annotation ? I only know snpEff wich put annotation in INFO fields as follow : ANN=A|324|234

Arkanosis commented 7 years ago

In that case, you'll probably want to use some indexed backend like sqlite (which handles offsets and limits for pagination, btw).

As for annotation specs, I'm only aware of that of SnpEFF (ie. EFF=A|324|234 and ANN=A|324|234) and VEP (ie. CSQ=A|324|234). I've never heard of any other widely-used composite INFO field.

dridk commented 7 years ago

Ok, I have two option.

In all case, I think I can avoid table joining by saving all variant data in one table .

dridk commented 7 years ago

After some reflexion during my night, I propose the following idea : For each .vcf.gz create the sqlite clone .vcf.db Sqlite support query on different database, so I can easily imagine to intersect 2 database if necessary.