Important changes in the variant index

tl;dr The current variant index has several shortcomings that affect our production infrastructure, our ability to update it to a newer GnomAD version and our ability to consider rare variants within the Genetics Portal. Here, we present a proposed plan to enhance the way we represent more complex and diverse disease-related variants in a scalable way. The plan includes some compromises that we will need to consider moving forward.

The current variant index is a process defined in the genetics-variant-annotation repository. At the moment, we pull all variants from GnomAD 2 and apply a MAF > 0.1%. The dataset results in 72M+ "common" variants. We have a dataset usually referred to as variant-annotation which includes richer information including all variant consequences (vep) and a post-processed dataset generally referred to as variant-index that stores less metadata for the purpose of the web application.

Current problems

Some problems associated with the current setup are:

GnomAD version. The version of GnomAD is lagging behind. @DSuveges made an attempt in 2021 (#491) to update it, but we struggled to make a decision on what variants to include and the downstream consequences. The full GnomAD dataset contains ~800M variants. A MAF >0.1% filter will reduce it to ~300M but still way higher than the current ~70M. As of today, this will have terrible consequences for some of the downstream processes as explained below.
The size of the data. While the size of the variant-annotation and the variant-index datasets it's still manageable. There is an explosion of data in the v2g and v2g2d datasets just because of the fact that we have so many variants. v2g is built so it calculates all variant-to-gene information as long as we have some features about the relationship. Because one of the features is vep this implies that we are effectively storing any variant-to-gene pair as long as they are within a window. This results in 1,031,401,898 records. The v2g2d is then a further explosion of this dataset. Overall, these datasets demand a large amount of computation, egress, ingestion time, etc. The latest ingestion process in ES/CH took 17h which is really a burden on any process improvement (https://github.com/opentargets/platform/issues/1961). There is also an associated financial cost on required big machines to deal with this data.
Absence of rare variants. The definition of a MAF threshold implies the genetics portal can't serve rare variants in any form. While according to opentargets/issues#2073 this has very little impact on GWAS associations lost (n=834), this threshold prevents us from using the same index to store other disease-associated variants (e.g. ClinVar, UKBB exome seq, etc.). Ultimately, we would like the same index to be used by the Platform variants and this is not possible at this stage.

Proposed plan

Next, we describe the series of actions, we are planning to execute in order to tackle the current problems.

Universe of variants This will be the main change. We are planning to reduce the set of variants in the variant-annotation / variant-index to only those variants with disease-associated information. The hope is that by better scoping the purpose of the genetics portal we would streamline the process by at least one order of magnitude.

This is an important change that will imply that variants without any link to a disease/trait will not be available for search and will not have their corresponding page in the genetics portal. The understanding is that the Genetics Portal does not become a catalogue of canonical variants, but instead captures variants that are - directly or indirectly - associated with disease, independently of their allele frequency. As mentioned above, we are at the moment not providing a good atlas for variants, since we only have 70M out of the 800M variants that GnomAD currently catalogues.

The inclusion criteria we would require is to include all lead or tag variants in the genetics portal. This is a composite of several tables covering:

top loci
Fine-mapping
LD
Credible set

The intersection of all variants in these studies is calculated here and accounts for 5,388,925 unique variants (way less than the current 70M+).

After consideration we decided not to include in the variant index, variants that are captured only by some of the V2G features (e.g. eQTL, pQTL, pHIC, etc.). We understood GWAS-significant associations will still be accessible in the studies included in the Portal and the V2G data (e.g. eQTL included at a lower threshold) is there for the purpose of explaining the V2G relationship for the relevant V2D signals.

Although, we haven't quantified the impact this change will have in all datasets and processes is expected that the reduction of the variant-index from 70M to 5M will give us more than an order of magnitude gains in the v2g dataset (currently 1B), the v2g2d dataset (currently 1.5B) and the sumstats filtered dataset (currently 600M).

For the first stage of implementation, we are not considering the inclusion of any new variants outside the scope of the current Genetics Portal (e.g. rare variants), but we could expect this will be the case in the near future.

Annotations

In order to annotate the universe of variants, we will (left) join our variants to GnomAD and bring the same metadata we currently have. By left-joining implies that we could start having variants that are not present in GnomAD, therefore lack some of the metadata. This can potentially affect downstream processes (e.g. web application, L2G) which we will need to review.

As another layer of annotations, we want to include extra information about the consequences of coding variants. We are in conversations with the OTAR2048 team to bring Uniprot positions, AA changes, and - eventually - some extra annotation on the conservation or the consequence of the mutation in the protein. This could further assist downstream processes such as L2G.

Dependencies

There are currently some dependencies of the variant-annotations in the V2D pipeline that we will need to resolve. At the moment, it's used to map variants to rsIds.

For further iterations

There are other improvements that we have not yet discussed and will probably require further scoping down the road. Some of these are the inclusion of alternative variant identifiers (e.g. G4GH) or structural variants.

@DSuveges and @JarrodBaker will be leading the implementation plan

This sounds like a sensible plan.

Some considerations (which may or may not be relevant, and may or may not be a priority at the current stage but something to consider for future-proofing):

Could we use the same index for the Portal and Platform?
For the annotation, it would also be great if we could pull in Clinvar pathogenicity associations
Ability to represent the zygosity of the variant (perhaps as part of the annotation to link to disease?) This will be important for rare variants and especially the LOF data (e.g. homozygous). (Unless we decide to represent the inheritance pattern of the gene-disease association instead e.g. recessive?).
Related to the above point, ability to link variants together (e.g. to be able to represent compound heterozygous variants that are associated with a disease).
Ability to link genes together (thinking about structural variation, gene fusions but perhaps we may want to also represent haplotypes?)
Ability to represent copy number variants (or at a gene level), STRs
UI: Could we somehow have a link allowing the user to easier submit the variant as part of the GA4GH Beacon to see if there are similar phenotype/disease associations accross global datasets?

We use the variant index to get allele frequency information when importing studies from GWAS catalog, since not all studies include this information. The allele frequency is needed in fine-mapping, to determine the approximate Bayes Factors. It's also needed for colocalisation. Therefore, we currently drop all variants that don't have allele frequency data and aren't in the variant index. If we didn't have AF information for all variants, there are many GWAS that we wouldn't be able to import. (This doesn't affect some studies such as FinnGen and UK Biobank, where we always have AF info already.) https://github.com/opentargets/genetics-sumstat-data/blob/master/ingest/gwas_catalog/scripts/process.py#L125

Would it be feasible to have a comprehensive variant annotation table (e.g. all 800 M variants) that is used only when needed? For example, to impute missing allele frequencies when importing GWAS or other genetic studies. I can imagine that in other research scenarios it would be useful to be able to annotate any variant using e.g. BigQuery.

But for processing genetics portal data, you could subset to variants in the credible set / LD tables.

I also question whether the PheWAS data should be subsetted even further. We have already reduced it to p < 0.005. But if we include only variants in credible sets, then you won't even be able to look up a variant with p = 1e-7 if it isn't in some study's genome-wide significant credible set.

Thanks for the pointer @Jeremy37. We were totally missing that dependency.

I agree, we can work out what's the minimum set of information required and provide this table to assist the fine-mapping and coloc computation. It shouldn't be a problem to have a parquet/BQ available with the 800M variants. Alternatively, we could evaluate whether is better to implement the step to get the relevant GnomAD AFs within the respective repos using hail in GCP. Particularly if the VEP information is not required, none of these options should be a problem. @DSuveges keep this in mind.

Regarding the pheWAS, it's also something we were debating yesterday because there is a clear compromise there. Based on the plan described above, every PheWAS plot in the site will at least have 1 dot above the GWAS-significant threshold. It won't affect any of the other dots we currently present in the same plot if we keep using the same threshold (p < 0.005). But it won't be possible to explore the PheWAS plot for a variant without at least one GWAS significant locus, even if the variant is close to the threshold in one or several studies. There is a different debate about whether we should perform PheWAS analysis.

There is a performance benefit in the fact that we won't require to show that many pheWAS plots. In this case, the decrease in the sumstats filtered table only goes down from 500M to 250M. This is confirming that the majority of associated phenotypes (significant or non-significant) are linked to a relatively low number of variants. By dropping 93% of the variants we only loose 50% of the associations p<0.005. Although a performance gain, it's within the same order of magnitude, so not really something @DSuveges and I weigh-in to elaborate the proposal. There are intermediate solutions like including variants in the variant index if they have at least one association at p<1e-5. But not sure there are enough scientific arguments to justify such decision as studies are more and more powered.

I think it's important to raise and debate these points. This is a rescue plan including some sacrifices with the hope of making much bigger gains. It's good to red team this proposal and question all elements of it.

Thanks @Jeremy37 for pointing out the application of the variant index in that early step! In theory, it is not an huge issue to fetch the allele information each time directly from gnomad any time where it is needed. I have explored the performance of the generation of the variant annotation table for the 5.3M variants we have (as described by @d0choa above). Joining the 5.3M with the 800M gnomad table + doing all the formatting, lift-over took ~1.5 hours on a small 2 node cluster (saved here). Hail is pretty powerful. However not sure how the process is structured, doing this for 1000s of studies it might be a complicated.

On the side note, only 51_377 (0.95%) of the 5.3M variants could not be found in gnomad3.

As mentioned above, we are at the moment not providing a good atlas for variants

Would that not be a valuable goal for the roadmap since it has a lot of downstream use cases and the union of gnomAD, UKB, ClinVar, etc. variants are not on their own large enough in number to be unmanageable yet (<1B)? I'm thinking of projects like MARRVEL [1] and VarSome [2] which sort of do that but it seems like OT is in a great position to create something better, with comparable utility to the target index you all are already maintaining (as an outer join from multiple sources instead of a more limited, use-case specific subset).

Would it be feasible to have a comprehensive variant annotation table (e.g. all 800 M variants) that is used only when needed?

This might also be valuable in supporting future experiments to define what indirect disease association means (as @d0choa mentioned limiting support to). I can see that being hard to iterate on if there's always something like a mult-hour join via Hail required first. My two cents, fwiw.

We were totally missing that dependency.

Do you think association with disease from co-occurrence of literature mentions is another source that might substantially shrink the gap between disease associated variants and canonical variants? I'm thinking of something like LitVar, and wondering if you all have considered anything like it given that you are tagging genes, diseases, and drugs in literature (variants seem a natural extension).

We are already addressing all this in https://github.com/opentargets/genetics_etl_python. Please keep in touch if you want to know more

opentargets / issues

Important changes in the variant index #2074

Current problems

Proposed plan