plantbreeding / BrAPI

Repository for version control of the BrAPI specifications
https://brapi.org
MIT License
55 stars 32 forks source link

Filtering variantmatrix data by MAF range and max missing #546

Open khaled-alshamaa opened 2 years ago

khaled-alshamaa commented 2 years ago

Can the proposed /search/variantmatrix endpoint has basic filtering fields (e.g., minmaf, maxmaf, and missingData) just like the GA4GH /variants/search endpoint?

I can see the great advantage of this new "variantmatrix" endpoint, saving time and bandwidth when retrieving genotyping data. But we will lose a significant part of that advantage by forcing the clients to request the whole data set and do that filtering at their end!

I believe it will be wise to enable the requester to apply that kind of filtering before packing and sending the requested data back via API. It will serve the primary goal of having this "variantmatrix" endpoint in the first place, isn't it?

GuilhemSempere commented 2 years ago

Actually, the plain GA4GH specs do not support these filtering parameters: https://rest.ensembl.org/documentation/info/gavariants. They were added as an enrichment to the Gigwa V2 implementation because we wanted to make it API-driven without losing any of V1's functionalities. But of course I can only agree that these are pretty useful filters. Although applying them can take a long time on large datasets.

patrick-koenig commented 2 years ago

Hello, this issue is somewhat related to #551 as calculated metrics/statistics like MAF, heterozygosity etc. can also be seen as variant-level metadata like the INFO column metadata of VCF. So we can discuss both topics in context in the meetings of the Genotyping Call Enhancement Working Group.

khaled-alshamaa commented 3 months ago

@GuilhemSempere I would like to see GIGWA supporting all these (optional/extra) minMaf, maxMaf, minMissingData, maxMissingData, minHeZ, and maxHeZ filtering parameters in the allele matrix search call. Just like what they did for the GA4GH /variants/search call.

@BrapiCoordinatorSelby If we agree on their usefulness from the application point of view, why not suggest having them in the BrAPI /search/allelematrix parameters?

@patrick-koenig I understand the service provider's argument that such parameters can put a huge computational load on the server side. Well, what will be the alternative option, the client application has nothing but extract the whole dataset, and then do the filtering at their side. This scenario implies more transfer data cost and fewer requests the server can manage because of the transfer time required for each!

@BrapiCoordinatorSelby I believe adding these filtering parameters can be a win-win situation for both server and client implementation in most production environments.

GuilhemSempere commented 3 months ago

I don't see major problems implementing this in Gigwa's allelematrix call, but indeed we'd rather have this functionality well defined and made official before starting something. I think @patrick-koenig you are referring to pre-computed annotation fields based on the overall list of callSets, whereas @khaled-alshamaa is mentioning realtime calculation based on a selected list of material.