malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Different nomenclature in genome_features between Ag and Af #499

Closed jonbrenas closed 7 months ago

jonbrenas commented 7 months ago

Genome features for Ag come from VectorBase and the ones for Af come from VEuPathDB and the two databases use slightly different nomenclatures for the genome features. Inn particular, VectorBase has a feature called 'gene' while VEuPathDB doesn't: it uses 'protein_coding_gene' instead. This is significative because (among other things), the function _gene_cnv (which is part of anopheles.py and thus shared between Ag and Af) looks for 'gene' features and can't find any in the genome feature data frame. Hence, it fails completely.

alimanfoo commented 7 months ago

Thanks Jon. Let me know if you'd like to have a go at fixing.

jonbrenas commented 7 months ago

I'm going to give it a try.

alimanfoo commented 7 months ago

Cool thanks.

Btw there already is an attribute available ._gff_gene_type which is set to the correct value for Ag3 ("gene") and Af1 ("protein_coding_gene").

So within the _gene_cnv() method in the AnophelesDataResource class, it should be possible to replace:

df_genes = df_genome_features.query("type == 'gene'")

...with something like:

df_genes = df_genome_features.query(f"type == '{self._gff_gene_type}'")
jonbrenas commented 7 months ago

My long term solution was to create such an attribute. Glad it already exists!