What's happening here is confusing at best. There's surely an issue with the order of my commands and I probably wasn't using the API as intended but I'm probably not the last one to use the API like this.
mcfarland_1000_test.h5ad.zip
To reproduce:
!lamin init --storage run-tests --name run-tests --schema bionty
import anndata as ad
# file is attached
adata = ad.read_h5ad("mcfarland_1000_test.h5ad")
import lamindb as ln
import bionty as bt
curate = ln.Curator.from_anndata(
adata,
var_index=bt.Gene.ensembl_gene_id,
organism="human",
)
curate.validate()
which gives us:
• mapping var_index on Gene.ensembl_gene_id
! found 2367 validated terms: ['BLA']
→ save terms via .add_validated_from_var_index()
! 30318 terms are not validated: 'MIR1302-10', 'FAM138A', 'OR4F5', 'RP11-34P13.7', 'RP11-34P13.8', 'AL627309.1', 'RP11-34P13.14', 'RP11-34P13.9', 'AP006222.2', 'RP4-669L17.10', 'OR4F29', 'RP4-669L17.2', 'RP5-857K21.15', 'RP5-857K21.1', 'RP5-857K21.2', 'RP5-857K21.3', 'RP5-857K21.4', 'RP5-857K21.5', 'OR4F16', 'RP11-206L10.3', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from('var_index')
so far so good. Since I saw several symbols, I thought that I'd want to standardize them so next I run:
# Map mix of ensembl IDs and gene symbols in the var_index to ensembl IDs
gene_mapper = bt.Gene.standardize(
curate.non_validated["var_index"],
field="symbol",
return_field="ensembl_gene_id",
return_mapper=True,
organism="human",
)
adata.var.index = adata.var.index.map(lambda x: gene_mapper.get(x, x))
adata = adata[:, ~adata.var.index.isin(curate.non_validated["var_index"])].copy()
which gives us
! found 18677 symbols in Bionty: ['C1ORF220', ] blabla
So for now I just added the validated ones that were caught earlier:
curate.add_validated_from_var_index()
which gives us
✓ added 20508 records from public with Gene.ensembl_gene_id for var_index: 'bla'
! 11641 non-validated values are not saved in Gene.ensembl_gene_id: ['']!
→ to lookup values, use lookup().var_index
→ to save, run add_new_from_var_index
So for some reason we just want to set up a curate against because we thought we removed unvalidated genes
import lamindb as ln
import bionty as bt
curate = ln.Curator.from_anndata(
adata,
var_index=bt.Gene.ensembl_gene_id,
organism="human",
)
curate.validate()
which gives us
✓ var_index is validated against Gene.ensembl_gene_id
Report
What's happening here is confusing at best. There's surely an issue with the order of my commands and I probably wasn't using the API as intended but I'm probably not the last one to use the API like this. mcfarland_1000_test.h5ad.zip To reproduce:
which gives us:
so far so good. Since I saw several symbols, I thought that I'd want to standardize them so next I run:
which gives us
So for now I just added the validated ones that were caught earlier:
which gives us
So for some reason we just want to set up a curate against because we thought we removed unvalidated genes
which gives us
TADA AND WE'RE DONE. So let's save the Artifact:
but what is this?
So it wasn't properly validated (for good reason)
Version information
No response