related-sciences / ensembl-genes

Extract the Ensembl genes catalog to simple tables
Other
17 stars 4 forks source link

Separate description source from gene description text #11

Closed dhimmel closed 2 years ago

dhimmel commented 3 years ago

Example gene descriptions by species:

Notice the trailing bracketed source information like "[Source:HGNC Symbol;Acc:HGNC:11858]". It would be nice to separate this description source information into a separate column, such that it's possible to isolate the actual description.

Question: is the source string always going to be in the format of [Source:SOURCE;Acc:CURIE] for all species and descriptions?

dhimmel commented 2 years ago

I'm looking to extract the gene_description source information in SQL, but when I use the REGEXP_SUBSTR mySQL function, I get the error:

ProgrammingError: (mysql.connector.errors.ProgrammingError) 1370 (42000): execute command denied to user 'anonymous'@'%' for routine 'homo_sapiens_core_105_38.REGEXP_SUBSTR'

Also, I don't think REGEXP_SUBSTR supports extracting matched groups.

Based on these issues, seems like we should parse the description in Python instead.

dhimmel commented 2 years ago

Noting that not all descriptions have source information. Here are some examples without:

There are also cases where gene_description is null.

dhimmel commented 2 years ago

Rerunning 105 exports in https://github.com/related-sciences/ensembl-genes/actions/runs/1564648697 to include gene description updates.