Closed mossjacob closed 1 day ago
Could you check if this is the correct db? https://github.com/laminlabs/bionty/blob/e75518cc5474084c15cd45a39aac0b89ee7db4cb/bionty/base/entities/_gene.py#L104
I think for plants it's a different db.
I think it should be the same database (although it is not working). See https://plants.ensembl.org/info/data/mysql.html
Ensembl Genomes databases from all five divisions are located on the same server
and
The following conventions apply: core databases -
core _ _
But I can't locate, for example, a database such as
arabidopsis_thaliana_core_57_10
Which should be there as it is a model organism..
Could you try mysql+mysqldb://anonymous:@ensembldb.ensembl.org/plants/{self._organism.core_db}
?
also an Unknown database
, unfortunately!
This is particularly odd: loading from mysql+mysqldb://anonymous:@ensembldb.ensembl.org
works for items in
https://ftp.ensemblgenomes.ebi.ac.uk/pub/current/vertebrates/mysql/
but not for items in
https://ftp.ensemblgenomes.ebi.ac.uk/pub/current/plants/mysql
Found the fix. Needed to add the right port for some reason:
mysql+mysqldb://anonymous:@ensembldb.ensembl.org:4157/arabidopsis_thaliana_core_60_113_11
It seems the plant and the vertebrates databases use different ports. I will alter this PR to include a check.
What about using mysql-eg-publicsql.ebi.ac.uk
instead of ensembldb.ensembl.org
? It's the first option here: https://plants.ensembl.org/info/data/mysql.html
I think this is still incomplete, actually--loading in the downloaded Ensembl plant genes doesn't work because the data frames don't have the ensembl_gene_id
column which is set in bionty.Gene._ontology_id_field
. Instead, they have stable_id
or ncbi_gene_id
columns. Unsure if I should create a new Gene model, maybe PlantGene
with a different _ontology_id_field
or whether you think there is a better way to resolve this?
@mossjacob I'll look into this. I'll report back
I think this is still incomplete, actually--loading in the downloaded Ensembl plant genes doesn't work because the data frames don't have the
ensembl_gene_id
column which is set inbionty.Gene._ontology_id_field
. Instead, they havestable_id
orncbi_gene_id
columns. Unsure if I should create a new Gene model, maybePlantGene
with a different_ontology_id_field
or whether you think there is a better way to resolve this?
Ah, this is similar to the yeast case, could you take a look here? https://bionty-assets-gczz.netlify.app/ingest/gene-ensembl-release-112#saccharomyces-cerevisiae
Thanks, I took a look at that notebook. I think it is the same as that example, and I get the same output ("no ensembl_gene_id found, writing to table_id column."), but then when I try to run:
gene_source = bt.Source().filter(organism="plants", entity="bionty.Gene").first()
bt.Gene.import_from_source(source=gene_source)
it looks for the field ensembl_gene_id
which doesn't exist for these tables, in line 241 _from_values.py
of lamindb;
result = public_ontology.inspect(iterable_idx, field=field.field.name, mute=True)
Dear @mossjacob,
sorry, I'm still catching up.
After lunch, I will look into your last issue. I'll report back!
Concerning
Thanks, I took a look at that notebook. I think it is the same as that example, and I get the same output ("no ensembl_gene_id found, writing to table_id column."), but then when I try to run:
gene_source = bt.Source().filter(organism="plants", entity="bionty.Gene").first() bt.Gene.import_from_source(source=gene_source)
it looks for the field
ensembl_gene_id
which doesn't exist for these tables, in line 241_from_values.py
of lamindb;
result = public_ontology.inspect(iterable_idx, field=field.field.name, mute=True)
@sunnyosun made me aware that this is a current limitation of from_source
that does not support stable_id
. I'll make an issue for this. What works for saccharomyces cerevisiae (which is similar to yours as you noted above) is the following:
!lamin init --storage run-tests --schema bionty
import lamindb as ln
import bionty as bt
# The instance is empty. Therefore, we add saccharomyces cerevisiae
bt.Organism.from_source(ontology_id="NCBITaxon:559292").save()
# Save all gene records to the instance
genes = bt.Gene.from_values(bt.Gene.public(organism="saccharomyces cerevisiae").df()["stable_id"],
field="stable_id",
organism="saccharomyces cerevisiae"
)
ln.save(genes)
# Look at our new genes
bt.Gene.df()
Does this help you? Edit: Sorry for closing - I fat fingered the wrong button.
Hi! Thank you for this! I will try this out at some point today or tomorrow. For now I was using:
prev_ontology_id = bt.Gene._ontology_id_field
bt.Gene._ontology_id_field = "stable_id"
bt.Gene.import_from_source(source=gene_source)
bt.Gene._ontology_id_field = prev_ontology_id
which I know is not ideal!
Hi! Thank you for this! I will try this out at some point today or tomorrow. For now I was using:
prev_ontology_id = bt.Gene._ontology_id_field bt.Gene._ontology_id_field = "stable_id" bt.Gene.import_from_source(source=gene_source) bt.Gene._ontology_id_field = prev_ontology_id
which I know is not ideal!
It's super cool that you figured this out even though _ontology_id_field is not user-facing at all!
Then no need to try from_values
, we'll make a proper fix for import_from_source
!
I enjoy a good debug :) Thanks!
Okay so apparently Pandas 2.2 is not compatible with sqlalchemy 1.4 which I still had on my PC. I reverted the changes now that I made earlier to the SQL statements that fixed that.
I'll make the CI run on this PR soon and then we can consider merging this.
Would you like us to also add some plant organisms genes such as arabidopsis thaliana to Bionty so that it works out of the box for you?
sqlalchemy < 2 is no good to use anymore! 😇 😆
Impressively low-level contributions! @mossjacob 😄
Thanks everyone. Re adding to bionty-assets, while that would be nice, I envisage using quite a few different species so adding all to bionty-assets may be overkill at this point?
In this PR there's a for
loop for adding multiple organisms, and I'm also using the code below to download on the fly:
def verify_organism_exists(organism, version="release-57"):
if bt.Source().filter(organism=organism).count() == 0:
# Try syncing
bt.core.sync_all_sources_to_latest()
if bt.Source().filter(organism=organism).count() == 0:
# If the source still does not exist, then download it.
print("Organism does not exist in bionty.")
print(f"Attempting to download {organism}...")
ensembl_gene = EnsemblGene(organism=organism, version=version, kingdom="plants")
print("URL:", ensembl_gene._url)
df = ensembl_gene.download_df()
df["description"] = df["description"].str.replace(r"\[.*?\]", "", regex=True)
filename = f"df_{organism}__ensembl__{version}__Gene.parquet"
df.to_parquet(filename)
print(f"Downloaded {organism} to {filename}")
raise ValueError(f"Add '{filename}' to sources_local.yaml and run bt.core.sync_all_sources_to_latest()")
With the change I made in this other PR, the URL to the local parquet file created can be added to sources_local.yaml
and synced.
[edit] the code has been updated
Slightly altered the way gene tables are onboarded: the check for the ensembl_gene_id
column to consist only of ENS-prefixed IDs is quite strong; for example, for rice (Oryza sativa), some IDs are prefixed by ENS (seems to be mostly RNA) and protein-coding genes are prefixed by "Os". Without this change, all genes are removed from the df in the else clause.
Great @mossjacob! Thank you very much for your enthusiasm and contributions.
kingdom
to taxa
. Is that fine with you or would you prefer kingdom
?Ensembl
focused because the class is even named like that. We can generalize this better. This includes https://github.com/laminlabs/bionty/issues/160
2.2 Currently the code is weirdly mixing organism
and taxa
. We were overloading the organism
parameter to handle both but this doesn't really make sense. I would like to decouple that more clearly to get rid of the tiny hack that I introduced in this PR.I am ready to merge the PR now unless you want to keep building here? We'll also merge your sister PR for local parquet
files then.
Hi @Zethson , I am also ready for this to be merged in now! I still have to write a test for the local parquet file PR though. Many thanks
Some use cases, for example adding plants, requires adding a keyword argument specifying the kingdom in
EnsemblGene
.