Closed dhimmel closed 2 years ago
I'm looking to extract the gene_description source information in SQL, but when I use the REGEXP_SUBSTR
mySQL function, I get the error:
ProgrammingError: (mysql.connector.errors.ProgrammingError) 1370 (42000): execute command denied to user 'anonymous'@'%' for routine 'homo_sapiens_core_105_38.REGEXP_SUBSTR'
Also, I don't think REGEXP_SUBSTR
supports extracting matched groups.
Based on these issues, seems like we should parse the description in Python instead.
Noting that not all descriptions have source information. Here are some examples without:
There are also cases where gene_description is null.
Rerunning 105 exports in https://github.com/related-sciences/ensembl-genes/actions/runs/1564648697 to include gene description updates.
Example gene descriptions by species:
Notice the trailing bracketed source information like "[Source:HGNC Symbol;Acc:HGNC:11858]". It would be nice to separate this description source information into a separate column, such that it's possible to isolate the actual description.
Question: is the source string always going to be in the format of
[Source:SOURCE;Acc:CURIE]
for all species and descriptions?