opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

ETL Input - Target step, Gene Ontology RNA Lookup dataset data model and content refactoring proposal #3078

Open mbdebian opened 10 months ago

mbdebian commented 10 months ago

Background

Gene Ontology RNA Lookup dataset is collected by PIS according to this configuration

- uri: https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/database_mappings/ensembl.tsv  
  output_filename: ensembl.tsv  
  path: target-inputs/go

It's part of the input information for target step, as per this ETL configuration block

gene-ontology-rna-lookup {
    format = "csv"
    path = ${common.input}"/target-inputs/go/ensembl.tsv"
    options = [
        {k: "sep", v: "\\t"}
    ]
}

A data sample taken from the file looks like this

URS0000000055 ENSEMBL ENST00000585414 9606 lncRNA ENSG00000226803.9
URS00000000C9 ENSEMBL ENST00000514011 9606 lncRNA ENSG00000248309.9
URS00000000FD ENSEMBL ENST00000448543 9606 lncRNA ENSG00000234279.2
URS0000000344 ENSEMBL ENST00000633884 9606 lncRNA ENSG00000282594.1
URS0000000351 ENSEMBL ENST00000452009 9606 lncRNA ENSG00000235427.1
URS00000005D1 ENSEMBL ENST00000563639 9606 lncRNA ENSG00000260457.2
URS000000074D ENSEMBL ENSOCUT00000033131 9986 tRNA ENSOCUG00000029106.1
URS0000000787 ENSEMBL ENST00000452952 9606 lncRNA ENSG00000206142.9
URS0000000AA1 ENSEMBL ENST00000615750 9606 lncRNA ENSG00000277089.4
URS0000000AA1 ENSEMBL ENST00000633983 9606 lncRNA ENSG00000282721.1
URS0000000C0D ENSEMBL ENST00000582841 9606 lncRNA ENSG00000265443.1
URS0000000CF3 ENSEMBL ENST00000414886 9606 lncRNA ENSG00000226856.9

Data model refactoring proposal

The data is just a TSV file that lacks of a heading line, by looking at how the ETL uses the file, it would be useful to update its content with the following heading line

rnaCentralId database externalId ncbiTaxonId rnaType ensemblId

Taking into account that it's just a way of grouping that mapping information, that it uses plain TSV capabilities with no additional metadata, and regardless on whether this content is defined in some external documentation, I can see usefulness in including that heading line.