openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Add support for RefSeqFE and intergenic mappings to RefSeqGenes #600

Open Peter-J-Freeman opened 2 months ago

Peter-J-Freeman commented 2 months ago

Is your feature request related to a problem? Please describe. RefSeqFEs are now available

Describe the solution you'd like We need to be able to map to RefSeqGenes in intergenic (between transcript regions) to make these mappings

Peter-J-Freeman commented 2 months ago

The code does exist and works at least for RefSeqGenes in the currend DB build. I am making a build with RefSeqFE alignments now

Peter-J-Freeman commented 2 months ago
import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NC_000017.11:g.50201640='
genome_build = 'GRCh38'
select_transcripts = 'all'
validate = vval.validate(variant, genome_build, select_transcripts)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
{
    "flag": "intergenic",
    "intergenic_variant_1": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "LRG_1:g.5000=",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "NG_007400.1:g.5000=",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000017.10:g.48279001=",
                "vcf": {
                    "alt": "C",
                    "chr": "17",
                    "pos": "48279001",
                    "ref": "C"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000017.11:g.50201640=",
                "vcf": {
                    "alt": "C",
                    "chr": "17",
                    "pos": "50201640",
                    "ref": "C"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000017.10:g.48279001=",
                "vcf": {
                    "alt": "C",
                    "chr": "chr17",
                    "pos": "48279001",
                    "ref": "C"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000017.11:g.50201640=",
                "vcf": {
                    "alt": "C",
                    "chr": "chr17",
                    "pos": "50201640",
                    "ref": "C"
                }
            }
        },
        "reference_sequence_records": {
            "lrg": "http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG_1.xml",
            "refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_007400.1"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NC_000017.11:g.50201640=",
        "transcript_description": "",
        "validation_warnings": [
            "No transcripts found that fully overlap the described variation in the genomic sequence"
        ],
        "variant_exonic_positions": null
    },
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.2.1.dev539+g8a3b4db.d20240412",
        "vvdb_version": "vvdb_2024_4",
        "vvseqrepo_db": "VV_SR_2024_01/master",
        "vvta_version": "vvta_2024_01"
    }
}

Note, the position is mapped to the RefSeqGene NG_ as hoped :)

Peter-J-Freeman commented 2 months ago

We can also add additional annotation of the RefSeqGene. Foe example we couls populate the

        "gene_ids": {},
        "gene_symbol": "",

method. Look up mane transcript from Gene Symbol and pull back the stable gene id info.

If not a Gene record (i.e. is a functional element)

Pull back the information we have gathered from the FE record

Example for COL1A1 is

        "gene_ids": {
            "ccds_ids": [
                "CCDS11561"
            ],
            "ensembl_gene_id": "ENSG00000108821",
            "entrez_gene_id": "1277",
            "hgnc_id": "HGNC:2197",
            "omim_id": [
                "120150"
            ],
            "ucsc_id": "uc002iqm.4"
        },
        "gene_symbol": "COL1A1",

For a functional element it will be something like

        "gene_ids": {
            "ccds_ids": [
                ""
            ],
            "ensembl_gene_id": "",
            "entrez_gene_id": "127270644",
            "hgnc_id": "",
            "omim_id": [
                ""
            ],
            "ucsc_id": ""
        },
        "gene_symbol": "H3K27ac hESC enhancer chr1:161101327-161102150 (GRCh37/hg19 assembly coordinates)",

The gene info is much less because we are using the Entrez data and these do not necessarily map to other sources, unlike HGNC:IDs and I have to say that I do like that the associated name in the symbol field will define the mapping as GRCh37 where relevant. OK, it is not exactly a Symbol but there are no symbols and this is the least destructive option we have available.

Peter-J-Freeman commented 2 months ago

@John-F-Wagstaff @leicray @ifokkema . Any comments please?

Peter-J-Freeman commented 2 months ago

We could pull in the gene symbol i.e. For NG_162137 it is LOC129929080, but this will take more code changes and does it matter enough???

Or event the full gene name which would be Homo sapiens ATAC-STARR-seq lymphoblastoid active region 11 (LOC129929080) on chromosome 1

But since we link the NG_ record, is it worth the effort?

Peter-J-Freeman commented 2 months ago

I think for consistancy we need the gene symbols. but I am taking the easy path for now do the symbol for these FE regions will be akin to

"gene_symbol": Homo sapiens ATAC-STARR-seq lymphoblastoid active region 11 (LOC129929080) on chromosome 1

of for GRCh37

"gene_symbol": Homo sapiens H3K4me1 hESC enhancer GRCh37_chr1:172707-173207 (LOC127266739) on chromosome 1

so it is very clear they are not defined HGNC gene symbols

ifokkema commented 2 months ago

I'm still catching up with my emails. I'm not fully following what this is for; is there an issue or discussion page or feature request or something else to help me understand what this is related to?

Peter-J-Freeman commented 2 months ago

It expands the discussion in the HGVS nomenclatue committee.

Why would we use c. descriptions that go beyond the transcript boundary?

Because they hit promoters / enhancers etc.

This is bad practice because promoters and enhancers are not part of transcript records and have clearly defined and annotated reference sequences of their own. Sure of there is a RefSeqGene record, this can be used but for the intergenic promoters and enhancers there are now FE records, as I stated yesterday

It just so happens that VV can support these with very few code changes, so we can put together very easily outputs that provide you with the

NG_ description for the functional element (e.g. a promoter) A gene ID A gene symbol / description

As you know, I like to do things correctly and sensibly (conservatively) so want to model this before the next committee meeting and show you the error of your ways ;P

leicray commented 2 months ago

Everything that you might need to know about accessing the RefSeqFE data in various formats can be found in:

34876495_Supplemental_Table_S1.xlsx

leicray commented 2 months ago

To add to the reasons why RefSeqFE records need to be supported, consider what happens when a genomic variant is not found to be spanned by a transcript. At present, VV simply reports that that is the case and the user might suppose that the variant is not significant in the context of their disease of interest.

However, the variant might lie within a Functional Element and a significant sequence alteration (such as a large deletion) within the element might have a detrimental effect on the expression of one or more genes.

Peter-J-Freeman commented 2 months ago

Once the database is built I will add some examples. Getting there now. Just takes ages to spot errors since ther are >140,000 records to handle. But the code seems to be in place, so once built I think supporting these may be pretty simple. Fingers crossed.

ifokkema commented 2 months ago

Please correct my thinking, guys; as far as I understand it:

Peter-J-Freeman commented 2 months ago

@ifokkema Open a feature request. What we can do is add the functionality and provide warnings saying that the descriptions are not (and in my opinion should not be) HGVS compliant, but can be useful as annotative descriptions. @John-F-Wagstaff will work it into the roadmap.

Peter-J-Freeman commented 2 months ago

Always worth meetign half way :)

ifokkema commented 2 months ago

@ifokkema Open a feature request.

It's already there :grin: And added to the roadmap! Another thought is to open a GitHub project for this... :thinking: Oh, never mind me messing around interfering with your way of organizing things 😂

What we can do is add the functionality and provide warnings saying that the descriptions are not (and in my opinion should not be) HGVS compliant, but can be useful as annotative descriptions.

That also makes sense! But I'm totally fine building my own non-HGVS from it, by simply taking data from the API endpoint.

Always worth meetign half way :)

:smile:

Peter-J-Freeman commented 2 months ago

@John-F-Wagstaff. The code for VV is now in place, but there are a lot of RefSeqGene records now missing from SeqRepo, so getting errors like

[{'hgvs_refseqgene': 'NG_032957.1:g.4T=', 'gene': 'LOC101056699', 'valid': "Failed to fetch NG_032957.1 from SeqRepo (/Users/user/variantvalidator_data/seqdata/VV_SR_2024_01/master) ('Alias NG_032957.1 (namespace: None)')"}]

We need to add all records from

https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.40_GRCh38.p14/

Not sure where the RefSeqGene FASTA sequences are sourced from which go into SeqRepo

John-F-Wagstaff commented 2 months ago

As far as I know all RefSeqGene sequences come from

https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/

Previously we loaded the set of sequences that caused transcript records to be correctly loaded into VVTA, so we now should just need to load all of the .fna.gz files from this folder into seqrepo.

I have set this running on my end, I will upload a seqrepo_2024_04.tar version when it is finished, but if you want to you should be able to load them locally for testing.

Was there anything actually from https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.40_GRCh38.p14/ missing from your logs? That would be a different problem.

edit: the seqrepo command for loading files locally is seqrepo --root-directory <dir> load -n vvta <seq_file> with <dir> and <seq_file> swapped out as needed.

Peter-J-Freeman commented 2 months ago

It's the RefSeq FE dataset that seems to be absent in SeqRepo. This might well be them

https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/biological_region/

The prefix is NG_

John-F-Wagstaff commented 2 months ago

OK, that looks more like it, I have deleted the old version and started a re-build with this, but it won't solve everything.

I just went to check if NG_032957.1 got loaded into the (fatter base RSG) dataset and found it had not, so I searched RefSeq for it and got https://www.ncbi.nlm.nih.gov/nuccore/407892488 which is not a genomic sequence or a biological region but a psudo-gene transcript, an inferred i.e. predicted psudo-gene even, it is not in the standard Ref seq inputs and does not exist in https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/biological_region/ either, at least if grep on the unpacked form of human.biological_region.fna.gz is to be believed.

I think this might not be something we want to load, at all, I guess it ends up in the full alignment set we have to load now as the RSG only one got discontinued? If so we may need extra filtering on that input which is unpleasant, do you have any thoughts?

Peter-J-Freeman commented 2 months ago

NG032957.1 seems to be a Pseudogene and not a Pseudogene transcript, so we should be able to load it, but agreed, I do not see why we would want to load pseudogenes. Also, as you said, it is not in the functional elements fna file, so where would it be??? So, I guess we just load the biological regions data and filter out those that have NG prefixes but are not in SeqRepo.

Peter-J-Freeman commented 2 months ago

I don;t think VVTA needs to be addrwssed since there is no alignment data required that relates to transcripts. All alignment data are used from validator db

John-F-Wagstaff commented 2 months ago

Well that (filtering out those that have NG_ prefixes but are not in SeqRepo) would work, and simplifies things, for now, so long as we know what we want to add to SeqRepo at least.

Do you want the full RSG dataset i.e. including those that don't have transcripts loaded into VVTA?

Either way on the RSG front, do you otherwise just want the contests of https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/biological_region/human.biological_region.fna.gz ?

If you are OK with the human.biological_region.fna (+/- the full RSG extras) then although the update is still processing, if you give the OK, I should be able to upload it tonight, assuming it does not take too long to load, or tomorrow before lunch at the latest.