Genotypes API: genotype_call table management

laceysanderson commented 1 year ago

There was a discussion in Slack that a general API that could handle management of materialized views related to the genotype_call table would be useful.

In the discussion I suggested that an API to manage the following could reduce duplicated effort and allow us all to benefit for optimization for large data sets.

Create materialized view tables and indices (support partitioning in a number of different ways and allow other tools to define the materialized view they use)
Create the genotype_call table itself (not yet included in core Chado) for a consistent definition and indices.
New Tripal 4 BioTasks for sync'ing these materialized views. BioTasks have more fine-grained control over locks and provide the flexibility needed to optimize this process (i.e. multiple queries, chunking of data to be added, truncating existing data, etc.)
Extension of the new Tripal DBX to provide simplified querying which is aware of partitions.

In order to support the very different needs of different tools using the genotype_call table, this API would provide a means for tools to describe the materialized view name, columns, composition, queries, indices, optimization approaches, etc to be used by the API. It is understood that a single best practice materialized view for this type of data is not possible as with large datasets it is important to cater to the specific composition of the data and needs of the tool in order to be performant.

This API is NOT trying to force us all to use the same materialized views or even optimization approaches. Rather it is trying to provide all tools with a set of optimization approaches which can be selected from to support each tool optimally.

laceysanderson commented 1 year ago

Can each group please post a detailed comment on this issue using the following template:

## Group/Tool information

> Please provide details about the specific group or tools that use the approach 
> that will be described below. This should include the name(s) of the group, tool, 
> developers and any known implementing Tripal sites. If available, please include 
> links to the tool website, documentation and/or git repository.

## Materialized Views

> Please fill out the following for each materialized view you use. If you use 
> partitions, you can use regular expressions or tokens which allow you to describe 
> the materialized view template used for a set of partitions.

### Materialized View Name:

#### Columns (include name, data type, brief description):
 - 

#### Indices (include columns covered):
- 

#### Query / Approach to Populate:

## Partitioning

> A description of any current partitioning you do and/or of any partitioning you are 
> considering or think may be helpful for your data sets.

## Optimizations

> A description of things you have done or are trying to do to optimize the 
> materialized view sync process and/or the querying process in general.

## Scripts / Features / General discussion

> This is a free text section for you to provide any thoughts, talking points, feature 
> needs/wants, scripts, concerns, etc.

laceysanderson commented 1 year ago

Group/Tool information

Please provide details about the specific group or tools that use the approach that will be described below. This should include the name(s) of the group, tool, developers and any known implementing Tripal sites. If available, please include links to the tool website, documentation and/or git repository.

Group: University of Saskatchewan, Pulse Bioinformatics Tool(s): ND Genotypes (Genotype Matrix + Fields), Genotypes Loader Developers: Lacey-Anne Sanderson, Carolyn Caron Tripal Sites: KnowPulse

Materialized Views

Please fill out the following for each materialized view you use. If you use partitions, you can use regular expressions or tokens which allow you to describe the materialized view template used for a set of partitions.

Materialized View Name: mviewndg[genus]_variants

Columns (include name, data type, brief description):

variant_id, bigint, feature_id of the sequence variant
variant_name, varchar, the feature name of the sequence variant
variant_type, varchar, the type of variant (e.g. SNP, MNP, Indel).
srcfeature_id, bigint, Links to the feature the variant is located on. Ideally this is the chromosome.
srcfeature_name, varchar, The name of the srcfeature; hopefully the name of the chromosome.
fmin, bigint, Start position on the scrfeature as described by the featureloc table.
fmax, bigint, End position on the scrfeature as described by the featureloc table.
meta_data, jsonb, used for storing any additional metadata we may want but which don't need a full column

Indices (include columns covered):

srcfeature_name ASC, fmin ASC: Used for paging the genotype matrix as well as restricting to a range.
variant_id: Used for variant/marker pages when selecting allele calls.
variant_name: Used for genotype matrix when filtering to a list of variants.

Query / Approach to Populate:

Data is broken into 50 million record chunks and copied into a file using the following query. A linux command is then used on the file to collapse it to unique variants. This is done because it's faster then ordering and grouping in Postgresql. Then the collapsed file is copied into the mview table. Before sync the mview is truncated and indices dropped. After sync the indices are re-created.

      SELECT
        call.variant_id,
        v.name as variant_name,
        (SELECT value FROM chado.featureprop WHERE feature_id=v.feature_id AND type_id=:variantprop_type) as variant_type,
        loc.srcfeature_id as srcfeature_id,
        b.name as srcfeature_name,
        loc.fmin,
        loc.fmax,
        json_build_object('strand', loc.strand, 'featureloc_id', loc.featureloc_id) as meta_data
      FROM {$ndg_calls_name} call
        LEFT JOIN {feature} v ON v.feature_id=call.variant_id
        LEFT JOIN {featureloc} loc ON loc.feature_id=call.variant_id
        LEFT JOIN {feature} b ON b.feature_id=loc.srcfeature_id
      WHERE call.variant_id BETWEEN :min AND :max

Materialized View Name: mviewndg[genus]_calls

Columns (include name, data type, brief description):

variant_id, bigint, the feature_id of the loci with variation.
marker_id, bigint, the feature_id of the specific marker assay providing the genotype calls
marker_name, varchar, the name of the marker assay
marker_type, varchar, the type of marker (e.g. Golden Gate Assay, KASP, genome resequencing)
stock_id, bigint, Links to the DNA stock assayed by the marker
stock_name, varchar, The name of the DNA stock assayed
germplasm_id, bigint, Links to the parent germplasm the DNA was extracted from.
germplasm_name, varchar, The name of the parent germplasm.
project_id, bigint, The project_id of the project the marker was assayed as part of.
genotype_id, bigint, The genotype_id of the resultant allele call.
allele_call, varchar, the result of the marker assay (ie: AA).
meta_data, jsonb, used for storing any additional metadata we may want but which don't need a full column. typically this is genotype call quality information

Indices (include columns covered):

variant_id: Used for updating the mview_ndg_variants mview to chunk the records.
germplasm_id: Used for the genotype matrix when restricting to germplasm to add.
project_id: Used for the genotype matrix when restricting to projects.

Query / Approach to Populate:

We chunk the dataset into hopefully reasonable pieces for updating the materialized view. However, we would need to avoid sorting first since that has a large performance hit. Our solution is to break the data set by the allele/genotype for each call. While this makes no sense from a biological standpoint it is at least something we can assume is present (since calls are genotypes;-) ). This approach will suffer if there are a large number of alleles (ie: MNPs or indels) especially if the overall dataset is small. That said it is well suited to SNP data which is the bulk of our data currently.

We use a similar approach of copying to a file and then back in for performance reasons. There is no processing of the resulting file in this case. We also truncate + drop indices to start and then re-create indices after.

    SELECT
      gc.variant_id,
      gc.marker_id,
      m.name as marker_name,
      mt.value as marker_type,
      gc.stock_id,
      s.name as stock_name,
      g.stock_id as germplasm_id,
      g.name as germplasm_name,
      gc.project_id,
      gc.genotype_id,
      a.description as allele_call,
      gc.meta_data
    FROM {genotype_call} gc
      LEFT JOIN {feature} m ON m.feature_id=gc.marker_id
      LEFT JOIN {featureprop} mt ON mt.feature_id=m.feature_id AND mt.type_id=:markerprop_type
      LEFT JOIN {stock} s ON s.stock_id=gc.stock_id
      LEFT JOIN {organism} o ON o.organism_id=s.organism_id
      LEFT JOIN {stock}_relationship sr ON sr.".$sample_pos."=s.stock_id AND sr.type_id=:stockrel_type
      LEFT JOIN {stock} g ON g.stock_id=sr.".$germ_pos."
      LEFT JOIN {genotype} a ON a.genotype_id=gc.genotype_id
    WHERE gc.genotype_id=:id AND LOWER(o.genus)=LOWER(:partition)

Partitioning

A description of any current partitioning you do and/or of any partitioning you are considering or think may be helpful for your data sets.

We currently partition by genus but are currently considering of restricting it further to species and/or genome assembly :thinking. We're not sure on this front :-)

Optimizations

A description of things you have done or are trying to do to optimize the materialized view sync process and/or the querying process in general.

PostgreSQL performance tuning: We did a lot of work with EXPLAIN ANALYZE determining where in chado our mview queries most benefited from indexes. This also required removing indexes in some cases and re-arranging our query to get the postgresql query builder to use the correct indices/plan. I highly recommend the website "Use the Index, Luke!" https://use-the-index-luke.com/
Our materialized view are custom tables populated via the nd_genotypes module. This allows us to run multiple queries and to use the COPY command which greatly speeds things up!
We only sync over the weekend when traffic is slowest
We bring down all genotype functionality (i.e. restrict access to it), truncate the mview (faster then delete) and drop all indices on the mviews. Then copy the data in in chunks using the copy command and finally add back in the indices. You need to remove the indices when re-populating otherwise you repetitively hit triggers which slow things down substantially.
Our mviews are partitioned based on organism and we have two mviews, one for marker/variant/location information and another solely focused on genotypic calls + germplasm info

Scripts / Features / General discussion

This is a free text section for you to provide any thoughts, talking points, feature needs/wants, scripts, concerns, etc.

We already have this well handled by our module nd_genotypes but are happy to expand the implementation during the Tripal 4 upgrade to help other groups as well. We're also interested to see what approaches other groups have taken to optimize retrieval.

tripal / tripal_genetic