✨ Support `taxa="plants"` in `EnsemblGene`

mossjacob commented 1 week ago

Some use cases, for example adding plants, requires adding a keyword argument specifying the kingdom in EnsemblGene.

sunnyosun commented 1 week ago

Could you check if this is the correct db? https://github.com/laminlabs/bionty/blob/e75518cc5474084c15cd45a39aac0b89ee7db4cb/bionty/base/entities/_gene.py#L104

I think for plants it's a different db.

mossjacob commented 1 week ago

I think it should be the same database (although it is not working). See https://plants.ensembl.org/info/data/mysql.html

Ensembl Genomes databases from all five divisions are located on the same server

and

The following conventions apply: core databases - core__

But I can't locate, for example, a database such as

arabidopsis_thaliana_core_57_10

Which should be there as it is a model organism..

sunnyosun commented 1 week ago

Could you try mysql+mysqldb://anonymous:@ensembldb.ensembl.org/plants/{self._organism.core_db}?

mossjacob commented 1 week ago

also an Unknown database, unfortunately!

mossjacob commented 1 week ago

This is particularly odd: loading from mysql+mysqldb://anonymous:@ensembldb.ensembl.org works for items in

https://ftp.ensemblgenomes.ebi.ac.uk/pub/current/vertebrates/mysql/

but not for items in

https://ftp.ensemblgenomes.ebi.ac.uk/pub/current/plants/mysql

mossjacob commented 1 week ago

Found the fix. Needed to add the right port for some reason:

mysql+mysqldb://anonymous:@ensembldb.ensembl.org:4157/arabidopsis_thaliana_core_60_113_11

It seems the plant and the vertebrates databases use different ports. I will alter this PR to include a check.

sunnyosun commented 1 week ago

What about using mysql-eg-publicsql.ebi.ac.uk instead of ensembldb.ensembl.org? It's the first option here: https://plants.ensembl.org/info/data/mysql.html

mossjacob commented 1 week ago

Reference for the port fix

mossjacob commented 1 week ago

I think this is still incomplete, actually--loading in the downloaded Ensembl plant genes doesn't work because the data frames don't have the ensembl_gene_id column which is set in bionty.Gene._ontology_id_field. Instead, they have stable_id or ncbi_gene_id columns. Unsure if I should create a new Gene model, maybe PlantGene with a different _ontology_id_field or whether you think there is a better way to resolve this?

Zethson commented 1 week ago

@mossjacob I'll look into this. I'll report back

sunnyosun commented 1 week ago

I think this is still incomplete, actually--loading in the downloaded Ensembl plant genes doesn't work because the data frames don't have the ensembl_gene_id column which is set in bionty.Gene._ontology_id_field. Instead, they have stable_id or ncbi_gene_id columns. Unsure if I should create a new Gene model, maybe PlantGene with a different _ontology_id_field or whether you think there is a better way to resolve this?

Ah, this is similar to the yeast case, could you take a look here? https://bionty-assets-gczz.netlify.app/ingest/gene-ensembl-release-112#saccharomyces-cerevisiae

mossjacob commented 1 week ago

Thanks, I took a look at that notebook. I think it is the same as that example, and I get the same output ("no ensembl_gene_id found, writing to table_id column."), but then when I try to run:

gene_source = bt.Source().filter(organism="plants", entity="bionty.Gene").first()
bt.Gene.import_from_source(source=gene_source)

it looks for the field ensembl_gene_id which doesn't exist for these tables, in line 241 _from_values.py of lamindb;

result = public_ontology.inspect(iterable_idx, field=field.field.name, mute=True)

Zethson commented 6 days ago

Dear @mossjacob,

sorry, I'm still catching up.

Please don't be confused by the failing CI. This is an issue with forked repositories not having access to our secrets. It's on my TODO list to changes.
I pushed a few changes to your PR that refactor it a bit and also fix a few issues with the download of the Gene ontologies. I am surprised that you got it to run with the current code because with the latest versions of the dependencies it didn't work for me anymore.
I created a PR that demonstrates the usage https://github.com/laminlabs/bionty-assets/pull/41 - I guess this is also where you ended up. See: https://6735e98c1a52cf5bf1340a87--bionty-assets-gczz.netlify.app/ingest/plant-gene-ensembl-release-57

After lunch, I will look into your last issue. I'll report back!

Zethson commented 6 days ago

Concerning

Thanks, I took a look at that notebook. I think it is the same as that example, and I get the same output ("no ensembl_gene_id found, writing to table_id column."), but then when I try to run:
gene_source = bt.Source().filter(organism="plants", entity="bionty.Gene").first()
bt.Gene.import_from_source(source=gene_source)
it looks for the field ensembl_gene_id which doesn't exist for these tables, in line 241 _from_values.py of lamindb;

result = public_ontology.inspect(iterable_idx, field=field.field.name, mute=True)

@sunnyosun made me aware that this is a current limitation of from_source that does not support stable_id. I'll make an issue for this. What works for saccharomyces cerevisiae (which is similar to yours as you noted above) is the following:

!lamin init --storage run-tests --schema bionty

import lamindb as ln
import bionty as bt

# The instance is empty. Therefore, we add saccharomyces cerevisiae
bt.Organism.from_source(ontology_id="NCBITaxon:559292").save()

# Save all gene records to the instance
genes = bt.Gene.from_values(bt.Gene.public(organism="saccharomyces cerevisiae").df()["stable_id"],
                                    field="stable_id",
                                    organism="saccharomyces cerevisiae"
                                    )
ln.save(genes)

# Look at our new genes
bt.Gene.df()

Does this help you? Edit: Sorry for closing - I fat fingered the wrong button.

mossjacob commented 6 days ago

Hi! Thank you for this! I will try this out at some point today or tomorrow. For now I was using:

prev_ontology_id = bt.Gene._ontology_id_field
bt.Gene._ontology_id_field = "stable_id"
bt.Gene.import_from_source(source=gene_source)
bt.Gene._ontology_id_field = prev_ontology_id

which I know is not ideal!

sunnyosun commented 6 days ago

Hi! Thank you for this! I will try this out at some point today or tomorrow. For now I was using:
prev_ontology_id = bt.Gene._ontology_id_field
bt.Gene._ontology_id_field = "stable_id"
bt.Gene.import_from_source(source=gene_source)
bt.Gene._ontology_id_field = prev_ontology_id
which I know is not ideal!

It's super cool that you figured this out even though _ontology_id_field is not user-facing at all!

Then no need to try from_values, we'll make a proper fix for import_from_source!

mossjacob commented 6 days ago

I enjoy a good debug :) Thanks!

Zethson commented 6 days ago

Okay so apparently Pandas 2.2 is not compatible with sqlalchemy 1.4 which I still had on my PC. I reverted the changes now that I made earlier to the SQL statements that fixed that.

I'll make the CI run on this PR soon and then we can consider merging this.

Would you like us to also add some plant organisms genes such as arabidopsis thaliana to Bionty so that it works out of the box for you?

falexwolf commented 6 days ago

sqlalchemy < 2 is no good to use anymore! 😇 😆

falexwolf commented 6 days ago

Impressively low-level contributions! @mossjacob 😄

mossjacob commented 5 days ago

Thanks everyone. Re adding to bionty-assets, while that would be nice, I envisage using quite a few different species so adding all to bionty-assets may be overkill at this point?

In this PR there's a for loop for adding multiple organisms, and I'm also using the code below to download on the fly:

def verify_organism_exists(organism, version="release-57"):
    if bt.Source().filter(organism=organism).count() == 0:
        # Try syncing
        bt.core.sync_all_sources_to_latest()
        if bt.Source().filter(organism=organism).count() == 0:
            # If the source still does not exist, then download it.
            print("Organism does not exist in bionty.")
            print(f"Attempting to download {organism}...")
            ensembl_gene = EnsemblGene(organism=organism, version=version, kingdom="plants")
            print("URL:", ensembl_gene._url)
            df = ensembl_gene.download_df()
            df["description"] = df["description"].str.replace(r"\[.*?\]", "", regex=True)
            filename = f"df_{organism}__ensembl__{version}__Gene.parquet"
            df.to_parquet(filename)
            print(f"Downloaded {organism} to {filename}")
            raise ValueError(f"Add '{filename}' to sources_local.yaml and run bt.core.sync_all_sources_to_latest()")

With the change I made in this other PR, the URL to the local parquet file created can be added to sources_local.yaml and synced.

[edit] the code has been updated

mossjacob commented 5 days ago

Slightly altered the way gene tables are onboarded: the check for the ensembl_gene_id column to consist only of ENS-prefixed IDs is quite strong; for example, for rice (Oryza sativa), some IDs are prefixed by ENS (seems to be mostly RNA) and protein-coding genes are prefixed by "Os". Without this change, all genes are removed from the df in the else clause.

Zethson commented 5 days ago

Great @mossjacob! Thank you very much for your enthusiasm and contributions.

I had renamed kingdom to taxa. Is that fine with you or would you prefer kingdom?
There's a couple of follow up issues here that I would like to tackle in the future. 2.1 Making this less Ensembl focused because the class is even named like that. We can generalize this better. This includes https://github.com/laminlabs/bionty/issues/160 2.2 Currently the code is weirdly mixing organism and taxa. We were overloading the organism parameter to handle both but this doesn't really make sense. I would like to decouple that more clearly to get rid of the tiny hack that I introduced in this PR.

I am ready to merge the PR now unless you want to keep building here? We'll also merge your sister PR for local parquet files then.

mossjacob commented 1 day ago

Hi @Zethson , I am also ready for this to be merged in now! I still have to write a test for the local parquet file PR though. Many thanks

laminlabs / bionty

✨ Support `taxa="plants"` in `EnsemblGene` #153