pantherdb / pango

BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Normalize gene symbol and name fields #31

Closed dustine32 closed 1 year ago

dustine32 commented 1 year ago

Remove gene_symbol and gene_name fields from annotations.json. These fields should only live in the gene_info.json.

Use case is when gene symbols or names differ between the two sources these are pulled from: upstream annotation GAFs or gene.dat. Some examples:

The result of these "annotation says vs. gene_info says" arguments is that duplicate gene entries appear in the data: image

New code should attempt to always rescue blank values of these two fields by scavenging either GAF annotations or gene.dat for any non-blank value.