laminlabs / lamindb

A data framework for biology.
https://docs.lamin.ai
Apache License 2.0
123 stars 9 forks source link

Validated var index by Curator but when saving some genes are not validated #1966

Open Zethson opened 1 week ago

Zethson commented 1 week ago

Report

What's happening here is confusing at best. There's surely an issue with the order of my commands and I probably wasn't using the API as intended but I'm probably not the last one to use the API like this. mcfarland_1000_test.h5ad.zip To reproduce:

!lamin init --storage run-tests --name run-tests --schema bionty

import anndata as ad
# file is attached
adata = ad.read_h5ad("mcfarland_1000_test.h5ad")

import lamindb as ln
import bionty as bt

curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    organism="human",
)

curate.validate()

which gives us:

• mapping var_index on Gene.ensembl_gene_id
!    found 2367 validated terms: ['BLA']
      → save terms via .add_validated_from_var_index()
!    30318 terms are not validated: 'MIR1302-10', 'FAM138A', 'OR4F5', 'RP11-34P13.7', 'RP11-34P13.8', 'AL627309.1', 'RP11-34P13.14', 'RP11-34P13.9', 'AP006222.2', 'RP4-669L17.10', 'OR4F29', 'RP4-669L17.2', 'RP5-857K21.15', 'RP5-857K21.1', 'RP5-857K21.2', 'RP5-857K21.3', 'RP5-857K21.4', 'RP5-857K21.5', 'OR4F16', 'RP11-206L10.3', ...
      → fix typos, remove non-existent values, or save terms via .add_new_from('var_index')

so far so good. Since I saw several symbols, I thought that I'd want to standardize them so next I run:

# Map mix of ensembl IDs and gene symbols in the var_index to ensembl IDs
gene_mapper = bt.Gene.standardize(
    curate.non_validated["var_index"],
    field="symbol",
    return_field="ensembl_gene_id",
    return_mapper=True,
    organism="human",
)
adata.var.index = adata.var.index.map(lambda x: gene_mapper.get(x, x))

adata = adata[:, ~adata.var.index.isin(curate.non_validated["var_index"])].copy()

which gives us

! found 18677 symbols in Bionty: ['C1ORF220', ] blabla

So for now I just added the validated ones that were caught earlier:

curate.add_validated_from_var_index()

which gives us

✓ added 20508 records from public with Gene.ensembl_gene_id for var_index: 'bla'
! 11641 non-validated values are not saved in Gene.ensembl_gene_id: ['']!
      → to lookup values, use lookup().var_index
      → to save, run add_new_from_var_index

So for some reason we just want to set up a curate against because we thought we removed unvalidated genes

import lamindb as ln
import bionty as bt

curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    organism="human",
)

curate.validate()

which gives us

✓ var_index is validated against Gene.ensembl_gene_id

TADA AND WE'RE DONE. So let's save the Artifact:

artifact = curate.save_artifact(description="bla")

but what is this?

    18677 unique terms (88.50%) are validated for ensembl_gene_id
!    2420 unique terms (11.50%) are not validated for ensembl_gene_id: BLA

So it wasn't properly validated (for good reason)

Version information

No response

Zethson commented 6 days ago

adata = adata[:, ~adata.var.index.isin(curate.non_validated["var_index"])].copy() is surely an issue.