The hg38 genome lacks mitochondrial gene information

fo40225 commented 1 year ago

When choosing Human (GRCh38/hg38) or Human (hg38 1kg/GATK), the chrM does not display RefGene-related information. Only NC_012920 has gene information.

jrobinso commented 12 months ago

The default annotation set we use ncbiRefSeq.txt does not have annotations for the mitchondria. However, annotations are available in other sets.

The following file has annotations for chrM. You can download and load it from the file menu ("File > Load from File..."), or use the link directly without downloading from ("File > Load from URL") https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz

maximilianh commented 12 months ago

Excellent question.

The reason why NCBI has no chrM annotations in the ncbiRefSeq GFF file and they're not considered RefSeq genes is that these genes are not part of RefSeq, but a different part of NCBI (bacteria databases/pipelines) and they don't have RefSeq accessions, but special YP_ accessions. You can see that here: https://www.ncbi.nlm.nih.gov/gene/4535 - no RefSeq accession.

We could revisit this. I've always thought that this makes no sense for users. We could also have the NCBI RefSeq Other track as a default track. We could also reach out to NCBI for an opinion.

On Tue, Nov 21, 2023 at 6:32 PM Jim Robinson @.***> wrote:

The default annotation set we use ncbiRefSeq.txt https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ncbiRefSeq.txt.gz does not have annotations for the mitchondria. However, annotations are available in other sets.

The following file has annotations for chrM. You can download and load it from the file menu ("File > Load from File..."), or use the link directly without downloading from ("File > Load from URL") https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1439#issuecomment-1821361793, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TJKYC62OHPQOS2MTBTYFTQSJAVCNFSM6AAAAAA7T442D2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRGM3DCNZZGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jrobinso commented 12 months ago

@maximilianh I could add the ability to load a different annotation set for chrM / MT. Its a bit of a hack, but this is an important special case.

maximilianh commented 12 months ago

The "different annotation set" is our "RefSeq Other" track, so you can see these annotations, you just have to switch it on.

The reason why we haven't changed this, so "forced" the transcript into the "RefSeq curated" track is that we try not to change the annotations as they were provided by the data provider and one annotation track should be a consistent set of annotations, with e.g. same outlinks, same identifiers. In this case, the chrM annotations are in the GFF file but are not part of RefSeq, so, for example, the annotations don't have accessions and the outlinks back to NCBI won't work: https://www.ncbi.nlm.nih.gov/nuccore/?term=YP_003024026.1+AND+srcdb_refseq%5BPROP%5D I'll discuss some more with my colleagues, we usually talk over things many times before we come to a good solution. If you have suggestions or thoughts, don't hesitate to send them to, it would be good to be consistent between the browsers on important genes like this. We could also ask RefSeq to make these genes "real" genes and at least give us a way to link to them.

On Tue, Nov 21, 2023 at 9:27 PM Jim Robinson @.***> wrote:

@maximilianh https://github.com/maximilianh I could add the ability to load a different annotation set for chrM / MT. Its a bit of a hack, but this is an important special case.

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1439#issuecomment-1821628240, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TJX2NZ63E4NN5D2RY3YFUFBXAVCNFSM6AAAAAA7T442D2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRGYZDQMRUGA . You are receiving this because you were mentioned.Message ID: @.***>

maximilianh commented 12 months ago

This is a bug on our side. Either related to a file format change by NCBI at some point or something that sprang loose during release. We have code to catch the YP_ genes and put them into the curated track but it didn't work.

On Wed, Nov 22, 2023 at 3:05 PM Maximilian Haeussler @.***> wrote:

The "different annotation set" is our "RefSeq Other" track, so you can see these annotations, you just have to switch it on.

The reason why we haven't changed this, so "forced" the transcript into the "RefSeq curated" track is that we try not to change the annotations as they were provided by the data provider and one annotation track should be a consistent set of annotations, with e.g. same outlinks, same identifiers. In this case, the chrM annotations are in the GFF file but are not part of RefSeq, so, for example, the annotations don't have accessions and the outlinks back to NCBI won't work:

https://www.ncbi.nlm.nih.gov/nuccore/?term=YP_003024026.1+AND+srcdb_refseq%5BPROP%5D I'll discuss some more with my colleagues, we usually talk over things many times before we come to a good solution. If you have suggestions or thoughts, don't hesitate to send them to, it would be good to be consistent between the browsers on important genes like this. We could also ask RefSeq to make these genes "real" genes and at least give us a way to link to them.

On Tue, Nov 21, 2023 at 9:27 PM Jim Robinson @.***> wrote:

@maximilianh https://github.com/maximilianh I could add the ability to load a different annotation set for chrM / MT. Its a bit of a hack, but this is an important special case.

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1439#issuecomment-1821628240, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TJX2NZ63E4NN5D2RY3YFUFBXAVCNFSM6AAAAAA7T442D2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRGYZDQMRUGA . You are receiving this because you were mentioned.Message ID: @.***>

maximilianh commented 12 months ago

It turns out that RefSeq changed the format of the GFF file 2-3 years ago and we didn't notice, since the genes were still there, just in the wrong track.

I think I've fixed this now, can you look at https://genome-test.gi.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chrM and see if the genes that you're looking for are there? We can then re-release the track, and IGV can import the new file.

Another user just noticed on our mailing list that the IGH genes are not in the curated track either. They're similar, have no RefSeq accession, but also, they're not real transcripts. So these are also in the "other" track now, but maybe we should move them?

Same for pseudogenes and TCR segments. These are also in the "other" track.

On Wed, Nov 22, 2023 at 4:46 PM Maximilian Haeussler @.***> wrote:

This is a bug on our side. Either related to a file format change by NCBI at some point or something that sprang loose during release. We have code to catch the YP_ genes and put them into the curated track but it didn't work.

On Wed, Nov 22, 2023 at 3:05 PM Maximilian Haeussler @.***> wrote:

The "different annotation set" is our "RefSeq Other" track, so you can see these annotations, you just have to switch it on.

The reason why we haven't changed this, so "forced" the transcript into the "RefSeq curated" track is that we try not to change the annotations as they were provided by the data provider and one annotation track should be a consistent set of annotations, with e.g. same outlinks, same identifiers. In this case, the chrM annotations are in the GFF file but are not part of RefSeq, so, for example, the annotations don't have accessions and the outlinks back to NCBI won't work:

https://www.ncbi.nlm.nih.gov/nuccore/?term=YP_003024026.1+AND+srcdb_refseq%5BPROP%5D I'll discuss some more with my colleagues, we usually talk over things many times before we come to a good solution. If you have suggestions or thoughts, don't hesitate to send them to, it would be good to be consistent between the browsers on important genes like this. We could also ask RefSeq to make these genes "real" genes and at least give us a way to link to them.

On Tue, Nov 21, 2023 at 9:27 PM Jim Robinson @.***> wrote:

@maximilianh https://github.com/maximilianh I could add the ability to load a different annotation set for chrM / MT. Its a bit of a hack, but this is an important special case.

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1439#issuecomment-1821628240, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TJX2NZ63E4NN5D2RY3YFUFBXAVCNFSM6AAAAAA7T442D2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRGYZDQMRUGA . You are receiving this because you were mentioned.Message ID: @.***>

jrobinso commented 11 months ago

@maximilianh I can't speak for the original poster here but that looks good. IGV loads directly from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz, so if/when that gets updated it should work automatically.

If there is a bigbed version of this file we could load from that, which would be faster.

maximilianh commented 11 months ago

No, refGene is UCSC's realignment of these sequences which is helpful when there are indels between RefSeq and the genome but NCBI's alignment should be better for most people and especially for mapping variants around, see our FAQ https://genome.ucsc.edu/FAQ/FAQgenes.html#duplicates. We recommend that you use NCBI's alignment, so everyone is using the same "official" mapping to the human genome. We provide that at http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ncbiRefSeq.txt.gz and the format should be identical to refGene.txt.gz

We can totally provide the same data as a bigGenePred file. We tried to be as backwards compatible as possible with refGene so providing a bigGenePred file did not occur to me. We will look into providing it in both formats.

BTW Older versions of the data can be found at http://hgdownload.soe.ucsc.edu/goldenPath/archive/hg38/ncbiRefSeq/

On Wed, Nov 29, 2023 at 2:37 AM Jim Robinson @.***> wrote:

@maximilianh https://github.com/maximilianh I can't speak for the original poster here but that looks good. IGV loads directly from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz, so if/when that gets updated it should work automatically.

If there is a bigbed version of this file we could load from that, which would be faster.

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1439#issuecomment-1831053483, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TNBHMIJW2K63NQZ2KLYG2GVPAVCNFSM6AAAAAA7T442D2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZRGA2TGNBYGM . You are receiving this because you were mentioned.Message ID: @.***>

jrobinso commented 11 months ago

Sorry I was mistaken, we do use http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ncbiRefSeq.txt.gz. I will switch to a bigGenePred when available.

jrobinso commented 11 months ago

@maximilianh It occurs to me you already provide a BigGenePred file in GenArk, several in fact. RefSeq All, Curated, Predicted, etc. Do any of these directly correspond to ncbiRefSeq.txt.gz?

maximilianh commented 2 months ago

Sorry, I fixed this last year, but forgot to reply: main problem is solved: on chrM, the gene annotations are now in the curated subset, and have been for a year or so, so in the same file with all other genes, even if they don't have proper RefSeq accessions. we provide the YP_ protein accessions now in lieu of RefSeq nucleotide in that file, which means that outlinks will work.

On your questions:

I will switch to a bigGenePred when available. Looks like we haven't made the switch ourselves yet on hg38. But I changed the main .txt.gz file so no need to switch.

It occurs to me you already provide a BigGenePred file in GenArk, several in fact. RefSeq All, Curated, Predicted, etc. Do any of these directly correspond to ncbiRefSeq.txt.gz? Yes, all Genark RefSeq assemblies so those with GCF_ IDs have a bigGenePred ncbiRefSeq, e.g. in https://hgdownload.soe.ucsc.edu/hubs/GCF/003/597/395/GCF_003597395.1/bbi/GCF_003597395.1_ASM359739v1.ncbiRefSeq.bb. There also is a "predicted" file, for most assemblies, the "predicted" file is probably what you want to show by default, as there is almost no curation outside of the typical model organisms by RefSeq, I assume: https://hgdownload.soe.ucsc.edu/hubs/GCF/003/597/395/GCF_003597395.1/bbi/

igvteam / igv

The hg38 genome lacks mitochondrial gene information #1439