konradjk / exac_browser

Browser for ExAC consortium data
http://exac.broadinstitute.org
MIT License
106 stars 54 forks source link

COL3A1 annotation is incomplete #126

Closed pbyers43 closed 9 years ago

pbyers43 commented 9 years ago

I received a note to indicate that the full COL3A1 sequence was there. But it isn't. The annotation in this data base ends at the end of exon 48 and there are 51 exons in the gene. See the EVS annotation for the coordinates.

konradjk commented 9 years ago

Can you email me? Looks to be fixed on my end (last variant in the gene is 2:189876484 which is in the 51st exon), but send me details and we can try to figure out the issue offline.

pbyers43 commented 9 years ago

Konrad,

I located the apparent problem.

189872768 p.Val1142Asp PASS missense * 1 120290 0.000008313 2:189872776 A / G (rs147706051) http://exac.broadinstitute.org/variant/2-189872776-A-G 2 189872776
PASS
splice acceptor * 2 119866 0.00001669 2:189872787 T / C http://exac.broadinstitute.org/variant/2-189872787-T-C 2 189872787 p.Pro845Pro

The box doesn't come out quite right but I think you can see that the variant at 189872768 is listed as p.Val1142Asp and then at 189872787 it is listed as p.Pro845Pro. So the numbering goes to a previous exon. I think if you fix the numbering from that point down so that it continue from the p.Val1142Asp then the numbering will be correct. That last variant, which does map to the last exon when you go to the assigned UCSC site should be position 1462, not 1159.

Hope this helps,

Peter

Peter H. Byers, MD Departments of Pathology and Medicine (Medical Genetics) Box 357470 University of Washington Seattle, WA 98195-7470 Phone: 206-543-4206 FAX: 206-616-1899 Collagen Diagnostic Laboratory: http://www.pathology.washington.edu/clinical/collagen/

Privileged, confidential or patient identifiable information may be contained in this message. This information is meant only for the use of the intended recipients. If you are not the intended recipient, or if the message has been addressed to you in error, do not read, disclose, reproduce, distribute, disseminate or otherwise use this transmission. Instead, please notify the sender by reply e-mail, and then destroy all copies of the message and any attachments.

On 10/31/2014 4:35 PM, Konrad Karczewski wrote:

Can you email me? Looks to be fixed on my end (last variant in the gene is 2:189876484 which is in the 51st exon), but send me details and we can try to figure out the issue offline.

— Reply to this email directly or view it on GitHub https://github.com/konradjk/exac_browser/issues/126#issuecomment-61344939.

konradjk commented 9 years ago

I figured out a way to fix most of these by defaulting to canonical transcript if two transcripts have an equal annotation. This will still be an issue if a variant is missense for a shorter transcript, but synonymous for a longer one, but it's the best way I can think of geting across worst annotation in the table where possible.

pbyers43 commented 9 years ago

Dear Konrad,

Thanks for taking the time to look at this. I see that there are two major transcripts in Ensembl--201 and 001. The latter has the full length with all the coding exons and the former (201) has 9 or 10 exons deleted that seem to correspond to one of the transcripts that you have. I am not sure of the origin of the shorter one for Ensembl, but I think it is a known mutant gene from an individual with Ehlers Danlos sydnrome type IV and is clearly a disease causing transcript. the 001 transcript is the full length one and clear should become the standard wild type size that you use for referencing variants. I think you are on the way. But the COL3A1 file still contains the strangeness about which we corresponded previously. Would it help to actually talk about it, or is this enough.

Peter

Peter H. Byers, MD Departments of Pathology and Medicine (Medical Genetics) Box 357470 University of Washington Seattle, WA 98195-7470 Phone: 206-543-4206 FAX: 206-616-1899 Collagen Diagnostic Laboratory: http://www.pathology.washington.edu/clinical/collagen/

Privileged, confidential or patient identifiable information may be contained in this message. This information is meant only for the use of the intended recipients. If you are not the intended recipient, or if the message has been addressed to you in error, do not read, disclose, reproduce, distribute, disseminate or otherwise use this transmission. Instead, please notify the sender by reply e-mail, and then destroy all copies of the message and any attachments.

On 11/17/2014 12:39 PM, Konrad Karczewski wrote:

I figured out a way to fix most of these by defaulting to canonical transcript if two transcripts have an equal annotation. This will still be an issue if a variant is missense for a shorter transcript, but synonymous for a longer one, but it's the best way I can think of geting across worst annotation in the table where possible.

— Reply to this email directly or view it on GitHub https://github.com/konradjk/exac_browser/issues/126#issuecomment-63371583.

konradjk commented 9 years ago

Sorry, I forgot to note that it's fixed in the code, but for the website, this will actually be fixed in the next release (requires a server update, so I'll do it at an off-hour).

pbyers43 commented 9 years ago

Thanks. I'll look again later in the week.

Peter

Peter H. Byers, MD Departments of Pathology and Medicine (Medical Genetics) Box 357470 University of Washington Seattle, WA 98195-7470 Phone: 206-543-4206 FAX: 206-616-1899 Collagen Diagnostic Laboratory: http://www.pathology.washington.edu/clinical/collagen/

Privileged, confidential or patient identifiable information may be contained in this message. This information is meant only for the use of the intended recipients. If you are not the intended recipient, or if the message has been addressed to you in error, do not read, disclose, reproduce, distribute, disseminate or otherwise use this transmission. Instead, please notify the sender by reply e-mail, and then destroy all copies of the message and any attachments.

On 11/17/2014 1:06 PM, Konrad Karczewski wrote:

Sorry, I forgot to note that it's fixed in the code, but for the website, this will actually be fixed in the next release (requires a server update, so I'll do it at an off-hour).

— Reply to this email directly or view it on GitHub https://github.com/konradjk/exac_browser/issues/126#issuecomment-63375857.

pbyers43 commented 9 years ago

Konrad

I looked at COL3A1 again today and it remains the same. I think that the only solution is to use the single longest reference file to align.

A second question. Is there a way to have the c. location in addition to the p. specification? It is a big help in some situations, especially if it includes orientation of the intronic sequence variants.

Regards,

Peter

Peter H. Byers, MD Departments of Pathology and Medicine (Medical Genetics) Box 357470 University of Washington Seattle, WA 98195-7470 Phone: 206-543-4206 FAX: 206-616-1899 Collagen Diagnostic Laboratory: http://www.pathology.washington.edu/clinical/collagen/

Privileged, confidential or patient identifiable information may be contained in this message. This information is meant only for the use of the intended recipients. If you are not the intended recipient, or if the message has been addressed to you in error, do not read, disclose, reproduce, distribute, disseminate or otherwise use this transmission. Instead, please notify the sender by reply e-mail, and then destroy all copies of the message and any attachments.

pbyers43 commented 9 years ago

It appears that more than one reference sequence was used to annotate COL4A5. This is similar to a problem encountered in COL3A1. It would appear that NM_033380 should be used at the reference sequence.
Currently, several must be being used.

Peter

Peter H. Byers, MD Departments of Pathology and Medicine (Medical Genetics) Box 357470 University of Washington Seattle, WA 98195-7470 Phone: 206-543-4206 FAX: 206-616-1899 Collagen Diagnostic Laboratory: http://www.pathology.washington.edu/clinical/collagen/

Privileged, confidential or patient identifiable information may be contained in this message. This information is meant only for the use of the intended recipients. If you are not the intended recipient, or if the message has been addressed to you in error, do not read, disclose, reproduce, distribute, disseminate or otherwise use this transmission. Instead, please notify the sender by reply e-mail, and then destroy all copies of the message and any attachments.

konradjk commented 9 years ago

Sorry, finally got around to getting the update on the server: should be fixed now, let me know if it's not.

pbyers43 commented 9 years ago

Konrad,

COL3A1 now looks right. Thanks.

We have been sequencing some of the COL4 genes and I looked at COL4A5 and it has the same problem as the old COL3A1. I sent a note about it yesterday. In this case it looks as if there may have been multiple transcripts used for alignment. I suggested one to be used.

Peter

Peter H. Byers, MD Department of Pathology Box 357470 University of Washington Seattle, WA 98195-7470 206-543-4206 (Telephone) 206-616-1899 (FAX) http://www.pathology.washington.edu/clinical/collagen/

On Mon, 1 Dec 2014, Konrad Karczewski wrote:

Date: Mon, 01 Dec 2014 21:59:37 -0800 From: Konrad Karczewski notifications@github.com Reply-To: konradjk/exac_browser <reply+008fc076a6a755e76478b894215e08c5a63c5d586771323e92cf000000011095174 892a169ce02d40099@reply.github.com> To: konradjk/exac_browser exac_browser@noreply.github.com Cc: pbyers43 pbyers@uw.edu Subject: Re: [exac_browser] COL3A1 annotation is incomplete (#126)

Sorry, finally got around to getting the update on the server: should be fixed now, let me know if it's not.

— Reply to this email directly or view it on GitHub.[AI_AdkqZbTPIW5-yycslc9-NuBoYJjALks5nTUzIgaJpZM4C1prH.gif]

konradjk commented 9 years ago

I believe COL4A5 is behaving as expected (as described below), let me know if this is not the case.

Just to clarify by what is expected: on the gene page, for each variant, we report the worst annotation for any transcript. If the worst annotation is on multiple transcripts, we prefer to report the HGVS consequence for the canonical transcript (as defined by Ensembl v75) or if it's not on the canonical transcript, we choose a random transcript to report the HGVS consequence.

This is the default behavior which we can't really override for specific genes, but we are working on generating a set of "clinically-relevant transcripts" that we will highlight on the gene page somehow.

pbyers43 commented 9 years ago

Dear Konrad,

If you are like me, and probably many others who are involved in diagnostic labs, we want to know if the variant that we see in sequencing clinical samples has been seen and its frequency. From that and additional information about the nature of the changes, we make an assessment of pathogenicity. We generally use the Reference Gene sequence as the base for our annotation and its tie to the NM_ orientation for the coding sequence.

The problem encountered in the COL3A1 sequence, that you have now repaired, was that it used two difference sequences as reference, on the would have been the RefSeq and the other which was, in fact, acquired from a patient with a 10 exon deletion.

The COL4A5 sequence has a number of sequences noted in the list on the front page and rather than choosing the RefSeq one of those, or the longest of the sequences, the listing of variants hits and misses throughout without giving the identify of the sequence to which they different elements are referred. This makes it difficult to go through. The EVS solved this problem at one point by indicating the two reference sequences for the same nucleotide, when they used more than one. It was useful to an extent, but since we generally use the longest RefSeq as the annotation tool, that is probably the best.

Another question, while you're at it--in the current iteration you have two points of reference--the genomic nucleotide and the p.position. Are you planning to add the coding sequence reference point (c.XXX) in the next iteration. It is often a very useful place setting to have.

Many thanks,

Peter

Peter H. Byers, MD Departments of Pathology and Medicine (Medical Genetics) Box 357470 University of Washington Seattle, WA 98195-7470 Phone: 206-543-4206 FAX: 206-616-1899 Collagen Diagnostic Laboratory: http://www.pathology.washington.edu/clinical/collagen/

Privileged, confidential or patient identifiable information may be contained in this message. This information is meant only for the use of the intended recipients. If you are not the intended recipient, or if the message has been addressed to you in error, do not read, disclose, reproduce, distribute, disseminate or otherwise use this transmission. Instead, please notify the sender by reply e-mail, and then destroy all copies of the message and any attachments.

On 12/2/2014 1:05 PM, Konrad Karczewski wrote:

I believe COL4A5 is behaving as expected (as described below), let me know if this is not the case.

Just to clarify by what is expected: on the gene page, for each variant, we report the worst annotation for any transcript. If the worst annotation is on multiple transcripts, we prefer to report the HGVS consequence for the canonical transcript (as defined by Ensembl v75) or if it's not on the canonical transcript, we choose a random transcript to report the HGVS consequence.

This is the default behavior which we can't really override for specific genes, but we are working on generating a set of "clinically-relevant transcripts" that we will highlight on the gene page somehow.

— Reply to this email directly or view it on GitHub https://github.com/konradjk/exac_browser/issues/126#issuecomment-65304281.