Closed d0choa closed 1 year ago
This sounds like a sensible plan.
Some considerations (which may or may not be relevant, and may or may not be a priority at the current stage but something to consider for future-proofing):
We use the variant index to get allele frequency information when importing studies from GWAS catalog, since not all studies include this information. The allele frequency is needed in fine-mapping, to determine the approximate Bayes Factors. It's also needed for colocalisation. Therefore, we currently drop all variants that don't have allele frequency data and aren't in the variant index. If we didn't have AF information for all variants, there are many GWAS that we wouldn't be able to import. (This doesn't affect some studies such as FinnGen and UK Biobank, where we always have AF info already.) https://github.com/opentargets/genetics-sumstat-data/blob/master/ingest/gwas_catalog/scripts/process.py#L125
Would it be feasible to have a comprehensive variant annotation table (e.g. all 800 M variants) that is used only when needed? For example, to impute missing allele frequencies when importing GWAS or other genetic studies. I can imagine that in other research scenarios it would be useful to be able to annotate any variant using e.g. BigQuery.
But for processing genetics portal data, you could subset to variants in the credible set / LD tables.
I also question whether the PheWAS data should be subsetted even further. We have already reduced it to p < 0.005. But if we include only variants in credible sets, then you won't even be able to look up a variant with p = 1e-7 if it isn't in some study's genome-wide significant credible set.
Thanks for the pointer @Jeremy37. We were totally missing that dependency.
I agree, we can work out what's the minimum set of information required and provide this table to assist the fine-mapping and coloc computation. It shouldn't be a problem to have a parquet/BQ available with the 800M variants. Alternatively, we could evaluate whether is better to implement the step to get the relevant GnomAD AFs within the respective repos using hail in GCP. Particularly if the VEP information is not required, none of these options should be a problem. @DSuveges keep this in mind.
Regarding the pheWAS, it's also something we were debating yesterday because there is a clear compromise there. Based on the plan described above, every PheWAS plot in the site will at least have 1 dot above the GWAS-significant threshold. It won't affect any of the other dots we currently present in the same plot if we keep using the same threshold (p < 0.005). But it won't be possible to explore the PheWAS plot for a variant without at least one GWAS significant locus, even if the variant is close to the threshold in one or several studies. There is a different debate about whether we should perform PheWAS analysis.
There is a performance benefit in the fact that we won't require to show that many pheWAS plots. In this case, the decrease in the sumstats filtered
table only goes down from 500M to 250M. This is confirming that the majority of associated phenotypes (significant or non-significant) are linked to a relatively low number of variants. By dropping 93% of the variants we only loose 50% of the associations p<0.005. Although a performance gain, it's within the same order of magnitude, so not really something @DSuveges and I weigh-in to elaborate the proposal. There are intermediate solutions like including variants in the variant index if they have at least one association at p<1e-5. But not sure there are enough scientific arguments to justify such decision as studies are more and more powered.
I think it's important to raise and debate these points. This is a rescue plan including some sacrifices with the hope of making much bigger gains. It's good to red team this proposal and question all elements of it.
Thanks @Jeremy37 for pointing out the application of the variant index in that early step! In theory, it is not an huge issue to fetch the allele information each time directly from gnomad any time where it is needed. I have explored the performance of the generation of the variant annotation table for the 5.3M variants we have (as described by @d0choa above). Joining the 5.3M with the 800M gnomad table + doing all the formatting, lift-over took ~1.5 hours on a small 2 node cluster (saved here). Hail is pretty powerful. However not sure how the process is structured, doing this for 1000s of studies it might be a complicated.
On the side note, only 51_377 (0.95%) of the 5.3M variants could not be found in gnomad3.
As mentioned above, we are at the moment not providing a good atlas for variants
Would that not be a valuable goal for the roadmap since it has a lot of downstream use cases and the union of gnomAD, UKB, ClinVar, etc. variants are not on their own large enough in number to be unmanageable yet (<1B)? I'm thinking of projects like MARRVEL [1] and VarSome [2] which sort of do that but it seems like OT is in a great position to create something better, with comparable utility to the target index you all are already maintaining (as an outer join from multiple sources instead of a more limited, use-case specific subset).
Would it be feasible to have a comprehensive variant annotation table (e.g. all 800 M variants) that is used only when needed?
This might also be valuable in supporting future experiments to define what indirect disease association means (as @d0choa mentioned limiting support to). I can see that being hard to iterate on if there's always something like a mult-hour join via Hail required first. My two cents, fwiw.
We were totally missing that dependency.
Do you think association with disease from co-occurrence of literature mentions is another source that might substantially shrink the gap between disease associated variants and canonical variants? I'm thinking of something like LitVar, and wondering if you all have considered anything like it given that you are tagging genes, diseases, and drugs in literature (variants seem a natural extension).
We are already addressing all this in https://github.com/opentargets/genetics_etl_python. Please keep in touch if you want to know more
tl;dr The current variant index has several shortcomings that affect our production infrastructure, our ability to update it to a newer GnomAD version and our ability to consider rare variants within the Genetics Portal. Here, we present a proposed plan to enhance the way we represent more complex and diverse disease-related variants in a scalable way. The plan includes some compromises that we will need to consider moving forward.
The current variant index is a process defined in the genetics-variant-annotation repository. At the moment, we pull all variants from GnomAD 2 and apply a MAF > 0.1%. The dataset results in 72M+ "common" variants. We have a dataset usually referred to as
variant-annotation
which includes richer information including all variant consequences (vep
) and a post-processed dataset generally referred to asvariant-index
that stores less metadata for the purpose of the web application.Current problems
Some problems associated with the current setup are:
variant-annotation
and thevariant-index datasets
it's still manageable. There is an explosion of data in thev2g
andv2g2d
datasets just because of the fact that we have so many variants.v2g
is built so it calculates all variant-to-gene information as long as we have some features about the relationship. Because one of the features isvep
this implies that we are effectively storing any variant-to-gene pair as long as they are within a window. This results in 1,031,401,898 records. Thev2g2d
is then a further explosion of this dataset. Overall, these datasets demand a large amount of computation, egress, ingestion time, etc. The latest ingestion process in ES/CH took 17h which is really a burden on any process improvement (https://github.com/opentargets/platform/issues/1961). There is also an associated financial cost on required big machines to deal with this data.Proposed plan
Next, we describe the series of actions, we are planning to execute in order to tackle the current problems.
Universe of variants This will be the main change. We are planning to reduce the set of variants in the
variant-annotation
/variant-index
to only those variants with disease-associated information. The hope is that by better scoping the purpose of the genetics portal we would streamline the process by at least one order of magnitude.This is an important change that will imply that variants without any link to a disease/trait will not be available for search and will not have their corresponding page in the genetics portal. The understanding is that the Genetics Portal does not become a catalogue of canonical variants, but instead captures variants that are - directly or indirectly - associated with disease, independently of their allele frequency. As mentioned above, we are at the moment not providing a good atlas for variants, since we only have 70M out of the 800M variants that GnomAD currently catalogues.
The inclusion criteria we would require is to include all lead or tag variants in the genetics portal. This is a composite of several tables covering:
The intersection of all variants in these studies is calculated here and accounts for 5,388,925 unique variants (way less than the current 70M+).
After consideration we decided not to include in the variant index, variants that are captured only by some of the V2G features (e.g. eQTL, pQTL, pHIC, etc.). We understood GWAS-significant associations will still be accessible in the studies included in the Portal and the V2G data (e.g. eQTL included at a lower threshold) is there for the purpose of explaining the V2G relationship for the relevant V2D signals.
Although, we haven't quantified the impact this change will have in all datasets and processes is expected that the reduction of the variant-index from 70M to 5M will give us more than an order of magnitude gains in the
v2g
dataset (currently 1B), thev2g2d
dataset (currently 1.5B) and the sumstats filtered dataset (currently 600M).For the first stage of implementation, we are not considering the inclusion of any new variants outside the scope of the current Genetics Portal (e.g. rare variants), but we could expect this will be the case in the near future.
Annotations
In order to annotate the universe of variants, we will (left) join our variants to GnomAD and bring the same metadata we currently have. By left-joining implies that we could start having variants that are not present in GnomAD, therefore lack some of the metadata. This can potentially affect downstream processes (e.g. web application, L2G) which we will need to review.
As another layer of annotations, we want to include extra information about the consequences of coding variants. We are in conversations with the OTAR2048 team to bring Uniprot positions, AA changes, and - eventually - some extra annotation on the conservation or the consequence of the mutation in the protein. This could further assist downstream processes such as L2G.
Dependencies
There are currently some dependencies of the
variant-annotations
in the V2D pipeline that we will need to resolve. At the moment, it's used to map variants to rsIds.For further iterations
There are other improvements that we have not yet discussed and will probably require further scoping down the road. Some of these are the inclusion of alternative variant identifiers (e.g. G4GH) or structural variants.
@DSuveges and @JarrodBaker will be leading the implementation plan