google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.18k stars 719 forks source link

Issue with DeepTrio (1.1.0) representation of hemizygous X-chromosome variants in male #518

Closed Phillip-a-richmond closed 1 year ago

Phillip-a-richmond commented 2 years ago

Hello,

Noticed this issue with your tool DeepTrio regarding the representation of hemizygous variants in the non-pseudoautosomal (PAR) X-chromosome. This may be fixed now in 1.3? If so ignore this, but if not this is what I noticed.

Note, this is a simulated pathogenic variant from bamsurgeon, but the VCF representation is the focus of this problem.

Let's start with an IGV snapshot of the variant: image

Now, I'll go into the representation from the DeepVariant --> GVCF --> GLnexus pipeline:

DeepVariant Pipeline:

# Load singularity
module load singularity
BIN_VERSION="1.1.0"

# Load env for bcftools
ANNOTATEVARIANTS_INSTALL=/mnt/common/WASSERMAN_SOFTWARE/AnnotateVariants/
source $ANNOTATEVARIANTS_INSTALL/opt/miniconda3/etc/profile.d/conda.sh
conda activate $ANNOTATEVARIANTS_INSTALL/opt/AnnotateVariantsEnvironment

# Pull latest version, if you already have it, this will be skipped
singularity pull docker://google/deepvariant:"${BIN_VERSION}"

# Number of threads
NSLOTS=$SLURM_CPUS_PER_TASK

# Go to the submission directory (where the sbatch was entered)
cd $SLURM_SUBMIT_DIR
WORKING_DIR=/mnt/scratch/Public/TRAINING/GenomeAnalysisModule/StudentSpaces/Old/test/CaseAnalysis

## Set working space
mkdir -p $WORKING_DIR
cd $WORKING_DIR

#### GRCh38 #### 
echo "GRCh38 genome"
GENOME=GRCh38
FASTA_DIR=/mnt/common/DATABASES/REFERENCES/GRCh38/GENOME/
FASTA_FILE=GRCh38-lite.fa

SEQ_TYPE=WGS
BAM_DIR=$WORKING_DIR
FAMILY_ID=Case1
PROBAND_ID=Case1_proband
MOTHER_ID=Case1_mother
FATHER_ID=Case1_father
SIBLING_ID=.
PED=$FAMILY_ID.ped

MOTHER_PRESENT=true
FATHER_PRESENT=true
SIBLING_PRESENT=false

PROBAND_BAM=${PROBAND_ID}.sorted.bam
FATHER_BAM=${FATHER_ID}.sorted.bam
MOTHER_BAM=${MOTHER_ID}.sorted.bam
SIBLING_BAM=${SIBLING_ID}.sorted.bam

PROBAND_VCF=${PROBAND_ID}.vcf.gz
FATHER_VCF=${FATHER_ID}.vcf.gz
MOTHER_VCF=${MOTHER_ID}.vcf.gz
SIBLING_VCF=${SIBLING_ID}.vcf.gz

PROBAND_GVCF=${PROBAND_ID}.gvcf.gz
FATHER_GVCF=${FATHER_ID}.gvcf.gz
MOTHER_GVCF=${MOTHER_ID}.gvcf.gz
SIBLING_GVCF=${SIBLING_ID}.gvcf.gz

# Now use the booleans to choose whether or not you run each deepvariant over these samples
# We always run over proband

## Proband 
singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
    -B "${BAM_DIR}":"/bamdir" \
    -B "${FASTA_DIR}":"/genomedir" \
    -B "${OUTPUT_DIR}":"/output" \
    docker://google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref="/genomedir/$FASTA_FILE" \
  --reads="/bamdir/$PROBAND_BAM" \
  --output_vcf="/output/$PROBAND_VCF" \
  --output_gvcf="/output/$PROBAND_GVCF" \
  --num_shards=$NSLOTS \

# Sibling
if [ "$SIBLING_PRESENT" = true ] ; then

    singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
        -B "${BAM_DIR}":"/bamdir" \
        -B "${FASTA_DIR}":"/genomedir" \
        -B "${OUTPUT_DIR}":"/output" \
        docker://google/deepvariant:"${BIN_VERSION}" \
      /opt/deepvariant/bin/run_deepvariant \
      --model_type=WGS \
      --ref="/genomedir/$FASTA_FILE" \
      --reads="/bamdir/$SIBLING_BAM" \
      --output_vcf="/output/$SIBLING_VCF" \
      --output_gvcf="/output/$SIBLING_GVCF" \
      --num_shards=$NSLOTS 
fi

# Mother
if [ "$MOTHER_PRESENT" = true ] ; then

    singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
        -B "${BAM_DIR}":"/bamdir" \
        -B "${FASTA_DIR}":"/genomedir" \
        -B "${OUTPUT_DIR}":"/output" \
        docker://google/deepvariant:"${BIN_VERSION}" \
      /opt/deepvariant/bin/run_deepvariant \
      --model_type=WGS \
      --ref="/genomedir/$FASTA_FILE" \
      --reads="/bamdir/$MOTHER_BAM" \
      --output_vcf="/output/$MOTHER_VCF" \
      --output_gvcf="/output/$MOTHER_GVCF" \
      --num_shards=$NSLOTS 
fi

# Father
if [ "$FATHER_PRESENT" = true ] ; then

    singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
        -B "${BAM_DIR}":"/bamdir" \
        -B "${FASTA_DIR}":"/genomedir" \
        -B "${OUTPUT_DIR}":"/output" \
        docker://google/deepvariant:"${BIN_VERSION}" \
      /opt/deepvariant/bin/run_deepvariant \
      --model_type=WGS \
      --ref="/genomedir/$FASTA_FILE" \
      --reads="/bamdir/$FATHER_BAM" \
      --output_vcf="/output/$FATHER_VCF" \
      --output_gvcf="/output/$FATHER_GVCF" \
      --num_shards=$NSLOTS 
fi

#GLNexus
rm -rf $WORKING_DIR/${FAMILY_ID}_GLNexus.db/
/mnt/common/Precision/GLNexus/glnexus_cli -c DeepVariant${SEQ_TYPE} \
    -d $WORKING_DIR/${FAMILY_ID}_GLNexus.db/
        --threads $NSLOTS \
        $WORKING_DIR/*gvcf.gz \
        > ${FAMILY_ID}.glnexus.merged.bcf

bcftools view ${FAMILY_ID}.glnexus.merged.bcf | bgzip -c > ${FAMILY_ID}.merged.vcf.gz

Variant:

VCF header:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##GLnexusVersion=v1.3.1-0-g0e1c9c9
##GLnexusConfigName=DeepVariantWGS
##GLnexusConfigCRC32C=706912162
##GLnexusConfig={unifier_config: {drop_filtered: false, min_allele_copy_number: 1, min_AQ1: 10, min_AQ2: 10, min_GQ: 0, max_alleles_per_site: 32, monoallelic_sites_for_lost_alleles: true, preference: common}, genotyper_config: {revise_genotypes: true, min_assumed_allele_frequency: 9.99999975e-05, required_dp: 0, allow_partial_data: true, allele_dp_format: AD, ref_dp_format: MIN_DP, output_residuals: false, more_PL: true, squeeze: false, trim_uncalled_alleles: true, top_two_half_calls: false, output_format: BCF, liftover_fields: [{orig_names: [MIN_DP, DP], name: DP, description: "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read depth (reads with MQ=255 or with bad mates are filtered)\">", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}, {orig_names: [AD], name: AD, description: "##FORMAT=<ID=AD,Number=R,Type=Integer,Description=\"Allelic depths for the ref and alt alleles in the order listed\">", type: int, number: alleles, default_type: zero, count: 0, combi_method: min, ignore_non_variants: false}, {orig_names: [GQ], name: GQ, description: "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}, {orig_names: [PL], name: PL, description: "##FORMAT=<ID=PL,Number=G,Type=Integer,Description=\"Phred-scaled genotype Likelihoods\">", type: int, number: genotype, default_type: missing, count: 0, combi_method: missing, ignore_non_variants: true}]}}
##bcftools_viewVersion=1.10.2+htslib-1.10.2
##bcftools_viewCommand=view Case1.glnexus.merged.bcf; Date=Tue Feb 15 12:15:20 2022

Variant line

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  father  mother  proband
X   48684399    X_48684399_C_A  C   A   61  .   AF=0.5;AQ=61    GT:DP:AD:GQ:PL:RNC  0/0:22:22,0:50:0,75,749:..  0/1:37:19,18:54:54,0,64:..  1/1:18:0,18:52:61,55,0:..

DeepTrio

Now, with the DeepTrio -> GVCF -> GLNexus pipeline: Pipeline

# Load singularity
module load singularity
BIN_VERSION="1.1.0"

# Load env for bcftools
ANNOTATEVARIANTS_INSTALL=/mnt/common/WASSERMAN_SOFTWARE/AnnotateVariants/
source $ANNOTATEVARIANTS_INSTALL/opt/miniconda3/etc/profile.d/conda.sh
conda activate $ANNOTATEVARIANTS_INSTALL/opt/AnnotateVariantsEnvironment

# Pull latest version, if you already have it, this will be skipped
export SINGULARITY_CACHEDIR=$PWD
singularity pull docker://google/deepvariant:deeptrio-"${BIN_VERSION}"

# Number of threads
NSLOTS=$SLURM_CPUS_PER_TASK

# Go to the submission directory (where the sbatch was entered)
cd $SLURM_SUBMIT_DIR
WORKING_DIR=/mnt/scratch/Public/TRAINING/GenomeAnalysisModule/StudentSpaces/Old/test/CaseAnalysis/

## Set working space
mkdir -p $WORKING_DIR
cd $WORKING_DIR

#### GRCh38 #### 
echo "GRCh38 genome"
GENOME=GRCh38
FASTA_DIR=/mnt/common/DATABASES/REFERENCES/GRCh38/GENOME/
FASTA_FILE=GRCh38-lite.fa

SEQ_TYPE=WGS
BAM_DIR=$WORKING_DIR
Case_ID=Case1
FAMILY_ID=$Case_ID
PROBAND_ID=${Case_ID}_proband
MOTHER_ID=${Case_ID}_mother
FATHER_ID=${Case_ID}_father

PROBAND_BAM=${PROBAND_ID}.sorted.bam
FATHER_BAM=${FATHER_ID}.sorted.bam
MOTHER_BAM=${MOTHER_ID}.sorted.bam

PROBAND_VCF=${PROBAND_ID}.vcf.gz
FATHER_VCF=${FATHER_ID}.vcf.gz
MOTHER_VCF=${MOTHER_ID}.vcf.gz

PROBAND_GVCF=${PROBAND_ID}.gvcf.gz
FATHER_GVCF=${FATHER_ID}.gvcf.gz
MOTHER_GVCF=${MOTHER_ID}.gvcf.gz

# Run singularity
singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
    -B "${BAM_DIR}":"/bamdir" \
    -B "${FASTA_DIR}":"/genomedir" \
    -B "${OUTPUT_DIR}":"/output" \
    docker://google/deepvariant:deeptrio-"${BIN_VERSION}" \
    /opt/deepvariant/bin/deeptrio/run_deeptrio \
    --model_type=$SEQ_TYPE \
    --ref="/genomedir/$FASTA_FILE" \
    --reads_child="/bamdir/$PROBAND_BAM" \
    --reads_parent1="/bamdir/$FATHER_BAM" \
    --reads_parent2="/bamdir/$MOTHER_BAM" \
    --output_vcf_child="/output/$PROBAND_VCF" \
    --output_vcf_parent1="/output/$FATHER_VCF" \
    --output_vcf_parent2="/output/$MOTHER_VCF" \
    --sample_name_child="${PROBAND_ID}" \
    --sample_name_parent1="${FATHER_ID}" \
    --sample_name_parent2="${MOTHER_ID}" \
    --num_shards=$NSLOTS \
    --intermediate_results_dir="/output/intermediate_results_dir" \
    --output_gvcf_child="/output/$PROBAND_GVCF" \
    --output_gvcf_parent1="/output/$FATHER_GVCF" \
    --output_gvcf_parent2="/output/$MOTHER_GVCF" 

#GLNexus
/mnt/common/Precision/GLNexus/glnexus_cli -c DeepVariant${SEQ_TYPE} \
        $PROBAND_GVCF \
        $FATHER_GVCF \
        $MOTHER_GVCF \
        --threads $NSLOTS \
        > ${FAMILY_ID}.glnexus.merged.bcf

bcftools view ${FAMILY_ID}.glnexus.merged.bcf | bgzip -c > ${FAMILY_ID}.merged.vcf.gz

Variant

Header

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##GLnexusVersion=v1.3.1-0-g0e1c9c9
##GLnexusConfigName=DeepVariantWGS
##GLnexusConfigCRC32C=706912162
##GLnexusConfig={unifier_config: {drop_filtered: false, min_allele_copy_number: 1, min_AQ1: 10, min_AQ2: 10, min_GQ: 0, max_alleles_per_site: 32, monoallelic_sites_for_lost_alleles: true, preference: common}, genotyper_config: {revise_genotypes: true, min_assumed_allele_frequency: 9.99999975e-05, required_dp: 0, allow_partial_data: true, allele_dp_format: AD, ref_dp_format: MIN_DP, output_residuals: false, more_PL: true, squeeze: false, trim_uncalled_alleles: true, top_two_half_calls: false, output_format: BCF, liftover_fields: [{orig_names: [MIN_DP, DP], name: DP, description: "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read depth (reads with MQ=255 or with bad mates are filtered)\">", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}, {orig_names: [AD], name: AD, description: "##FORMAT=<ID=AD,Number=R,Type=Integer,Description=\"Allelic depths for the ref and alt alleles in the order listed\">", type: int, number: alleles, default_type: zero, count: 0, combi_method: min, ignore_non_variants: false}, {orig_names: [GQ], name: GQ, description: "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}, {orig_names: [PL], name: PL, description: "##FORMAT=<ID=PL,Number=G,Type=Integer,Description=\"Phred-scaled genotype Likelihoods\">", type: int, number: genotype, default_type: missing, count: 0, combi_method: missing, ignore_non_variants: true}]}}
##bcftools_viewVersion=1.10.2+htslib-1.10.2
##bcftools_viewCommand=view Case1.glnexus.merged.bcf; Date=Sun Jan 30 20:56:13 2022

Variant line

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Case1_father    Case1_mother    Case1_proband
X   48684399    X_48684399_C_A  C   A   45  .   AF=0.333333;AQ=45   GT:DP:AD:GQ:PL:RNC  0/0:22:22,0:50:0,75,749:..  0/1:37:19,18:45:45,0,54:..  0/1:18:0,18:4:33,0,1:..

Here the male proband, fully hemizygous for the variant, is represented as 0/1: 0/1:18:0,18:4:33,0,1:..

Why it matters

Now, this is a BIG problem, because downstream tools like Exomiser will treat the output here for variant prioritization in rare disease cases. So take a look at this.

Exomiser HTML for DeepVariant, where variant is represented as 1/1. This gene is the top ranked hit for this simulated rare disease case. Good:

image

Exomiser HTML output for DeepTrio, where variant is represented as 0/1. Notice that now the gene is not even in the top 10 candidates, as Exomiser has interpreted this variant incorrectly (likely assuming it was a PAR region on X). The variant score dropped to 0.00 because the assumption on the GT string was violated:

image

I can send the HTML outputs, VCFs, or even CRAMs if you are interested by email or other file transfer.

Anyways, going to try with 1.3 while I wait for an answer here.

Cheers, Phil

AndrewCarroll commented 2 years ago

Hi @Phillip-a-richmond

Thank you for the report. You are correct, that the behavior out-of-the box for DeepTrio can currently be sub-optimal for the sex chromosomes in male samples. We've benchmarked some strategies for dealing with this.

In the short term, our recommendation is to run separate calling on the non-PAR regions of ChrX and ChrY, where only the mother sample is provided as the parent for calling of the son, and (less importantly as it is unclear whether this is an issue with chrY) only the father sample is provided for calling chrY on the son.

In the documentation, this is expressed in the following statement: "Also please note: for the non-PAR regions of the sex chromosomes (X and Y), we recommend running these providing only the parent who contributed the child's chromosome (e.g. for chromosomeX, only the mother and son samples and for chromosomeY only the father and son samples)."

The DeepTrio manuscript has benchmarks for this strategy in the following section (near the end of results):

The Genome in a Bottle truth sets do not contain chromosomeX or chromosomeY variants in a male individual. As a result, DeepTrio has never been trained with hemizygous sites. Because we train DeepTrio to perform duo calling, it is likely that DeepTrio would call variants on chromosomeX similar to how it would call a duo sample. To assess this, we ran DeepVariant and DeepTrio on chromosomeX of the son (HG002) and measured the number of heterozygous variant calls in the non-PAR regions of chromosomeX.

For DeepVariant, 4.45% (455/101866) of calls in non-PAR regions of chromosomeX are heterozygous. In DeepTrio, 24.5% (21633/88314) are heterozygous. This substantial difference suggests that applying DeepTrio directly to chromosomeX in male samples is problematic. Since chromosomeX in males is inherited from the mother, we performed calling on chromosomeX with only the mother provided as the parent. This reduced heterozygous calls to 3.37% (3518/104427), which is better than in the DeepVariant case. For male samples, this recommends that variant calling should be run with both parents on the autosomal and PAR regions using a BED file to restrict location, and additional variant calling should be performed using only the mother’s file provided as parent for the non-PAR regions of chromosomeX, and only the father’s provided for the non-PAR regions of chromosomeY.

This experiment indicates that allowing the model to infer a hemizygous chromosome through coverage and explicitly training for hemizygous variants is an opportunity for improvement, both for DeepVariant and DeepTrio.

Over the long term:

We anticipate that the T2T consortium will be shortly be making available complete assemblies of ChrX and ChrY for the HG002 sample. This should allow us to make training labels for ChrX and ChrY in the hemizygous case, and we have a few internal ideas around how to use this to make models that will generally take this into account.

I would be curious in your feedback about whether the short term solution is acceptable, whether you find it insufficient, or if you would recommend other approaches for us to address the issue.

Thank you, Andrew

AndrewCarroll commented 2 years ago

Hi @Phillip-a-richmond

As an additional follow-up, I am wondering if my answer is off the mark, that I have discussed this with respect of accuracy of calling, while you would prefer fixes to the VCF representation. Is your main request that we have the ability to output a hemizygous representation of the VCF? This is something we are starting to think about for VCF and gVCF as we prepare to work with the T2T assemblies.

It's our anticipation that the sex-aware model would call these variants as 1/1 (since we will use that output class for hemizygous variants). It's also our anticipation that providing only the mother ChrX will result in a 1/1 call. However, we won't represent this specifically as hemizygous.

In order to resolve your issue, do we need to solve this hemizygous representation problem specifically, or will producing more 1/1 calls in the manner described in my prior comment be sufficient?

Phillip-a-richmond commented 2 years ago

Hi Andrew,

Thanks for your response. There are a few items here which I'll address.

  1. Is your main request that we have the ability to output a hemizygous representation of the VCF? This is something we are starting to think about for VCF and gVCF as we prepare to work with the T2T assemblies.

Yes proper representation on X chromosome is important, and it's caused this problem for downstream analyses of pathogenic hemizygous variants. Likely would be true for de novo hemizygous variants on male non-PAR chromosome X. If this were fixed sooner than later that would increase confidence in DeepTrio. Until that's fixed, I'll be shifting back to DeepVariant GVCF->GLNexus for all rare disease trio samples.

  1. In order to resolve your issue, do we need to solve this hemizygous representation problem specifically, or will producing more 1/1 calls in the manner described in my prior comment be sufficient?

If a variant is present in 100% of the reads, representing as 1/1 would be ideal as this is what's expected by downstream tools. The expectation for a male is to have majority variants on X as 1/1. It would be dangerous if a female had majority 1/1.

  1. Also please note: for the non-PAR regions of the sex chromosomes (X and Y), we recommend running these providing only the parent who contributed the child's chromosome (e.g. for chromosomeX, only the mother and son samples and for chromosomeY only the father and son samples)

Sounds like this involves a lot of mangling of samples and VCF cutting/manipulation...and no I'm not thrilled about having this pushed on the user. You're recommendation would be:

  1. Run DeepTrio on trio.
  2. BCFtools to cut-out chrX + Y
  3. Rerun DeepTrio on duo, once with mother and once with father, keeping the Y variants from paternal analysis, and X from maternal analysis. (?What happens with female child, no change?)
  4. Now you've got VCFs with improper columns (mom-proband, dad-proband, mom-dad-proband)so you can't just concat them together...
  5. Mangle the 2-3 VCFs together.

Essentially you're asking the user to run DeepTrio >2 times for it to work on a trio with a male proband, including manual VCF mangling on the user side. It would be much appreciated if this process can be internalized and parameterized within the run deepvariant command. Alternatively, if this is your best practice then providing code chunks to this effect would be appreciated.

At the end of the day, me, as a user, would prefer to run 1-2 commands to get a trio merged VCF. If you provided that code chunk to run the existing tool in the configuration you describe, then I'd be happy to insert it and test on my cases.

Thanks, Phil

AndrewCarroll commented 2 years ago

Hi @Phillip-a-richmond

I understand agree that this workflow is cumbersome and far from ideal. Here is the course of action that we'll propose to take:

Within the next 2 weeks, we anticipate that we'll likely to have GIAB labels for X and Y. The first course of action that we'll take is to incorporate those and attempt to train a new model with this. Based on whether this looks promising, we may ask if you are interested to test a Docker image with this model and provide feedback on it.

I can't guarantee that this will work, but I think it has reasonable odds. If it does not, we do have other ideas for X and Y calling, but they would take a bit more time.

I will plan to reach back out in 2 weeks with an update about the labels and a refined estimate for when we might have the model.

Thank you, Andrew

Phillip-a-richmond commented 2 years ago

Yes this sounds like a great plan. Happy to try out your docker beta version too see if it fixes this hemizygous issue.

Have a good weekend, Phil

AndrewCarroll commented 2 years ago

Hi @Phillip-a-richmond

Just as an update on this issue, we have the truth sets and have trained some experimental models. We're still in the process of refining these models, as well as how to use the T2T truth sets. So we do have progress, but it's still going to take time before we're ready with something for you to look at.

Phillip-a-richmond commented 2 years ago

Thanks @AndrewCarroll! Happy to test beta version when you produce them.

pichuan commented 2 years ago

Hi @Phillip-a-richmond , I want to give an update on this issue:

As part of working on v1.4.0, we did some experiments on this, which is still work in progress.

If you can reach out to me and @AndrewCarroll (you can email me at pichuan@google.com), we can follow up on an experimental model for you to test, if you're still interested.

pichuan commented 1 year ago

Hi @Phillip-a-richmond and everyone else who might be following this thread, One update:

one post-processing trick that you can do now is: take the probability distribution, and ignore the 0/1 one, and just renormalize the other two. From there, you can decide whether it should be a 0/0 or 1/1. (This idea came from our collaborator @doron-st at Ultima Genomics. They actually verified this on a dataset and showed good precision. They did this on DeepVariant, not DeepTrio. But I think the same trick can be applied)

Our team is currently considering building this into an option in postprocess_variants, so that we can do this re-normalization ourselves.

(Another more principled approach will be to build this knowledge into the modeling part. We're also considering that, but that will be an even longer-term solution.)

Thanks @Phillip-a-richmond again for reporting this. We have filed an internal issue to track this work. I'll close this bug for now.

pichuan commented 1 year ago

Follow up from my previous comment, @doron-st has shared his script here: https://github.com/Ultimagen/VariantCalling/blob/c2634d8c08db4473e61b786eed06afc2bb8fccd1/ugvc/pipelines/convert_haploid_regions.py (Thanks Doron!)

nurmians commented 1 year ago

Any updates on this? The link for the script is broken now.

pgrosu commented 1 year ago

@nurmians Still looking through the code - and looked at the release notes - though still reasoning through it in case it might have been implemented via some alternate logic. In any case, here's a link to the script for converting genotypes of regions to haploid calls:

https://github.com/Ultimagen/VariantCalling/blob/master/ugvc/pipelines/convert_haploid_regions.py

pichuan commented 1 year ago

Hi @nurmians , The link the @pgrosu link to seems like the right one from Ultima!

Our team is working on incorporating this as part of postprocess_variants. The official version will be out in the next release (before end of 2023). You can get a preview of these 2 flags that we'll be adding: https://github.com/google/deepvariant/blob/dev/deepvariant/postprocess_variants.py#L170-L188

(Note that we don't officially support the code in "dev" branch. If you want an official documentation and usage of this, please wait for the release later. But please feel free to take a look at the code if you like)

pgrosu commented 1 year ago

Hi Pi-Chuan,

I was thinking about the nice upcoming changes a bit. It will probably be easier to consolidate the two proposed flags into only one parameter such as --non_recombinant_regions (or --hemizygous_regions), since all the other regions go through recombination. The PAR regions is already the default behavior of DeepVariant as it assumes diploid recombinant behavior, so you don't need the --par_regions_bed flag.

~p

AndrewCarroll commented 1 year ago

Hi @pgrosu

Thank you for the suggestion. The exact way to manifest the flags came up in internal discussion. I actually proposed basically the approach you detail here (I wanted to call the parameter --haploid_regions).

The counter-argument was that people who have BED files for what they want to call more often have these files already in the format of the chromosome and a separate file of the PAR regions. So it is possible to generate the --haploid_regions or --hemizygous_regions using a bedtools subtract command, but that does require one other step and then users who wanted to document or publish what they have done have a non-standard file.

I was persuaded to drop my suggestion and endorse the solution you currently see in L170-L188 by this argument. However, I also did not see a large relative advantage with either approach from a user perspective, so I did not advocate particularly forcefully. I do like that the current arrangement will be able to specify some pretty standard community files.

If you think the difference for a user will be more strongly in preference of the single flag, we'd be open to the discussion, but hopefully the rationale for the current flag choice makes a bit more sense.

pgrosu commented 1 year ago

Hi Andrew,

I think we are in perfect agreement. Supporting what the community wants is my preferred approach as well.

I was only suggesting it after having performed a reduction upon the functional set of DeepVariant's design.

Just in case the community requires more variety of labeled inputs, having a parameter processor might help in mapping between DeepVariant's core operation and different parameter sets. This would create flexibility for both the goals of the design team and community.

For example, the design team might want flexibility in playing around with different transformations of read signals and parameter search spaces - which might change over time or with different data sets - without worrying that the parameter list might require updating.

Let's take the following simplified model optimization:

image

The team would let this model optimize itself to determine the parameter set under which it performs ideally. Then metadata about the training would be saved along with the optimized model. Then users can query the model's metadata via some program, which would provide information about the training data and parameters limits for different model types. Such metadata would be automatically populated through the model training step (where I'm including the validation and tuning steps). Based on that, a default set of parameters can be auto-generated for general users, and advanced users can then map their custom parameters however they want. It basically makes the training turn-key and usage automatic. Such training could even be remotely initiated through a browser, and different users can make some of their models available to others if they want, thus helping out the community and growing it at the same time.

This would also make adapting new models much faster, including where a large set of specialized ensemble models might be required to check against.

Best regards, Paul