phenopolis / phenopolis_genomics_browser

Python API and React frontend for the Phenopolis Genomics Browser
https://dev-live.phenopolis.org
MIT License
31 stars 2 forks source link

CACNA1F gene appears twice in suggestions when searching #318

Closed pontikos closed 3 years ago

pontikos commented 3 years ago

Gene CACNA1F appears twice. One of the two genes is empty, has no variants. I am assuming this is a backend issue witth the gene entered twice in the db or some other problem?

image

alanwilter commented 3 years ago

It's a interesting case, it's not duplicated:

phenopolis_dev_db=> select * from ensembl.gene where hgnc_symbol ~ 'CACNA1F';
-[ RECORD 1 ]--------------+-------------------------------------------------------------
identifier                 | 2326
ensembl_gene_id            | ENSG00000102001
version                    | 8
description                | calcium channel, voltage-dependent, L type, alpha 1F subunit
chromosome                 | X
start                      | 49061523
end                        | 49089833
strand                     | -1
band                       | p11.23
biotype                    | protein_coding
hgnc_id                    | 1393.0
hgnc_symbol                | CACNA1F
percentage_gene_gc_content | 52.87
assembly                   | GRCh37
-[ RECORD 2 ]--------------+-------------------------------------------------------------
identifier                 | 22617
ensembl_gene_id            | ENSG00000269057
version                    | 1
description                | calcium channel, voltage-dependent, L type, alpha 1F subunit
chromosome                 | HG1436_HG1432_PATCH
start                      | 49064462
end                        | 49092770
strand                     | -1
band                       | p11.23
biotype                    | protein_coding
hgnc_id                    | 1393.0
hgnc_symbol                | CACNA1F
percentage_gene_gc_content | 52.86
assembly                   | GRCh37
-[ RECORD 3 ]--------------+-------------------------------------------------------------
identifier                 | 44900
ensembl_gene_id            | ENSG00000102001
version                    | 13
description                | calcium voltage-gated channel subunit alpha1 F
chromosome                 | X
start                      | 49205063
end                        | 49233371
strand                     | -1
band                       | p11.23
biotype                    | protein_coding
hgnc_id                    | HGNC:1393
hgnc_symbol                | CACNA1F
percentage_gene_gc_content | 52.86
assembly                   | GRCh38

However, there may be something strange about it. I don't find ENSG00000269057 (rec 2) anywhere in UniProt, not even in UniParc and Ensembl says it's linked to UniProt O60840. Yet, note the chromosome | HG1436_HG1432_PATCH for rec 2. I know if are filtering out GRCh38 (so rec 3 should not be seen in our app) but, perhaps, we should filter out entries without clear chrom, what do you think?

For the moment I will nag UniProt guys about this.

alanwilter commented 3 years ago

You also spotted another problem where I need to refactor our code to stop looking into public.genes table and use only ensembl.gene table, new schema. But the "duplication" is in both.

alanwilter commented 3 years ago

And actually, I found out that we are not filtering out GRCh38, it's just that were still searching in the old schema, so if using the new schema we'd see duplication or more for everything. I'm going to rectify that and leave GRCh38 or GRCh37 as var or parameter, since I guess one day we'll move to GRCh38.

pontikos commented 3 years ago

Thanks Alan, exactly we should be filtering out non chromosomes (HG1436_HG1432_PATCH) and non build 37 (GRCh38).

So only keep chrom 1-22, X, Y and GRCh37.

We can also have an extra filter so we don't return genes which have no variants associated?

On Mon, 22 Mar 2021, 07:11 Alan Silva, @.***> wrote:

And actually, I found out that we are not filtering out GRCh38, it's just that were still searching in the old schema, so if using the new schema we'd see duplication or more for everything. I'm going to rectify that and leave GRCh38 or GRCh37 as var or parameter, since I guess one day we'll move to GRCh38.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/phenopolis/phenopolis_browser/issues/318#issuecomment-803820266, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5MN5DB3MQCYQQC3QLEDH3TE3UR5ANCNFSM4ZRY6WAA .

alanwilter commented 3 years ago

And what about 'MT'? I will start with basic filters. I even thought we use 'SwissProt', I mean, only genes that would be linked to a reviewed UP entry.

pontikos commented 3 years ago

I don't think we have any genetic variants on MT so we can ignore for now.

alanwilter commented 3 years ago

ok

pontikos commented 3 years ago

This relates to issue #328 Please remove genes from search if they are not on chromosomes 1...22, X, Y