tripal / tripal_genetic

Extends core Tripal with support focused on genotypic data, genetic markers, sequence variants, etc.
GNU General Public License v2.0
1 stars 0 forks source link

Genotypes API: genotype_call table management #1

Open laceysanderson opened 1 year ago

laceysanderson commented 1 year ago

There was a discussion in Slack that a general API that could handle management of materialized views related to the genotype_call table would be useful.

In the discussion I suggested that an API to manage the following could reduce duplicated effort and allow us all to benefit for optimization for large data sets.

In order to support the very different needs of different tools using the genotype_call table, this API would provide a means for tools to describe the materialized view name, columns, composition, queries, indices, optimization approaches, etc to be used by the API. It is understood that a single best practice materialized view for this type of data is not possible as with large datasets it is important to cater to the specific composition of the data and needs of the tool in order to be performant.

This API is NOT trying to force us all to use the same materialized views or even optimization approaches. Rather it is trying to provide all tools with a set of optimization approaches which can be selected from to support each tool optimally.

laceysanderson commented 1 year ago

Can each group please post a detailed comment on this issue using the following template:

## Group/Tool information

> Please provide details about the specific group or tools that use the approach 
> that will be described below. This should include the name(s) of the group, tool, 
> developers and any known implementing Tripal sites. If available, please include 
> links to the tool website, documentation and/or git repository.

## Materialized Views

> Please fill out the following for each materialized view you use. If you use 
> partitions, you can use regular expressions or tokens which allow you to describe 
> the materialized view template used for a set of partitions.

### Materialized View Name:

#### Columns (include name, data type, brief description):
 - 

#### Indices (include columns covered):
- 

#### Query / Approach to Populate:

## Partitioning

> A description of any current partitioning you do and/or of any partitioning you are 
> considering or think may be helpful for your data sets.

## Optimizations

> A description of things you have done or are trying to do to optimize the 
> materialized view sync process and/or the querying process in general.

## Scripts / Features / General discussion

> This is a free text section for you to provide any thoughts, talking points, feature 
> needs/wants, scripts, concerns, etc.
laceysanderson commented 1 year ago

Group/Tool information

Please provide details about the specific group or tools that use the approach that will be described below. This should include the name(s) of the group, tool, developers and any known implementing Tripal sites. If available, please include links to the tool website, documentation and/or git repository.

Group: University of Saskatchewan, Pulse Bioinformatics Tool(s): ND Genotypes (Genotype Matrix + Fields), Genotypes Loader Developers: Lacey-Anne Sanderson, Carolyn Caron Tripal Sites: KnowPulse

Materialized Views

Please fill out the following for each materialized view you use. If you use partitions, you can use regular expressions or tokens which allow you to describe the materialized view template used for a set of partitions.

Materialized View Name: mviewndg[genus]_variants

Columns (include name, data type, brief description):

Indices (include columns covered):

Query / Approach to Populate:

Data is broken into 50 million record chunks and copied into a file using the following query. A linux command is then used on the file to collapse it to unique variants. This is done because it's faster then ordering and grouping in Postgresql. Then the collapsed file is copied into the mview table. Before sync the mview is truncated and indices dropped. After sync the indices are re-created.

      SELECT
        call.variant_id,
        v.name as variant_name,
        (SELECT value FROM chado.featureprop WHERE feature_id=v.feature_id AND type_id=:variantprop_type) as variant_type,
        loc.srcfeature_id as srcfeature_id,
        b.name as srcfeature_name,
        loc.fmin,
        loc.fmax,
        json_build_object('strand', loc.strand, 'featureloc_id', loc.featureloc_id) as meta_data
      FROM {$ndg_calls_name} call
        LEFT JOIN {feature} v ON v.feature_id=call.variant_id
        LEFT JOIN {featureloc} loc ON loc.feature_id=call.variant_id
        LEFT JOIN {feature} b ON b.feature_id=loc.srcfeature_id
      WHERE call.variant_id BETWEEN :min AND :max

Materialized View Name: mviewndg[genus]_calls

Columns (include name, data type, brief description):

Indices (include columns covered):

Query / Approach to Populate:

We chunk the dataset into hopefully reasonable pieces for updating the materialized view. However, we would need to avoid sorting first since that has a large performance hit. Our solution is to break the data set by the allele/genotype for each call. While this makes no sense from a biological standpoint it is at least something we can assume is present (since calls are genotypes;-) ). This approach will suffer if there are a large number of alleles (ie: MNPs or indels) especially if the overall dataset is small. That said it is well suited to SNP data which is the bulk of our data currently.

We use a similar approach of copying to a file and then back in for performance reasons. There is no processing of the resulting file in this case. We also truncate + drop indices to start and then re-create indices after.

    SELECT
      gc.variant_id,
      gc.marker_id,
      m.name as marker_name,
      mt.value as marker_type,
      gc.stock_id,
      s.name as stock_name,
      g.stock_id as germplasm_id,
      g.name as germplasm_name,
      gc.project_id,
      gc.genotype_id,
      a.description as allele_call,
      gc.meta_data
    FROM {genotype_call} gc
      LEFT JOIN {feature} m ON m.feature_id=gc.marker_id
      LEFT JOIN {featureprop} mt ON mt.feature_id=m.feature_id AND mt.type_id=:markerprop_type
      LEFT JOIN {stock} s ON s.stock_id=gc.stock_id
      LEFT JOIN {organism} o ON o.organism_id=s.organism_id
      LEFT JOIN {stock}_relationship sr ON sr.".$sample_pos."=s.stock_id AND sr.type_id=:stockrel_type
      LEFT JOIN {stock} g ON g.stock_id=sr.".$germ_pos."
      LEFT JOIN {genotype} a ON a.genotype_id=gc.genotype_id
    WHERE gc.genotype_id=:id AND LOWER(o.genus)=LOWER(:partition)

Partitioning

A description of any current partitioning you do and/or of any partitioning you are considering or think may be helpful for your data sets.

We currently partition by genus but are currently considering of restricting it further to species and/or genome assembly :thinking. We're not sure on this front :-)

Optimizations

A description of things you have done or are trying to do to optimize the materialized view sync process and/or the querying process in general.

Scripts / Features / General discussion

This is a free text section for you to provide any thoughts, talking points, feature needs/wants, scripts, concerns, etc.

We already have this well handled by our module nd_genotypes but are happy to expand the implementation during the Tripal 4 upgrade to help other groups as well. We're also interested to see what approaches other groups have taken to optimize retrieval.