laminlabs / lamindb

A data framework for biology.
https://docs.lamin.ai
Apache License 2.0
129 stars 12 forks source link

🚸 Curator logging #2211

Open falexwolf opened 1 day ago

falexwolf commented 1 day ago

I'm working with this example:

df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7], "perturbation": ["DMSO", "IFNG", "DMSO"], "cell_type_by_expert": ["B cell", "T cell", "T cell"], "cell_type_by_model": ["B cell", "T cell", "T cell"]},
    index=["sample1", "sample2", "sample3"],
)
adata1 = ad.AnnData(
    df[["CD8A", "CD4", "CD14"]], obs=df[["perturbation", "cell_type_by_expert", "cell_type_by_model"]]
)

curator = ln.Curator.from_anndata(adata1, var_index=bt.Gene.symbol, categoricals={"perturbation": ln.ULabel.name, "cell_type_by_expert": bt.CellType.name,  "cell_type_by_model": bt.CellType.name}, organism="human")
curator.validate()
curator.save_artifact(key="datasets/dataset1.h5ad")

It logs the following

! Curating gene symbols is discouraged. See FAQ for more details.
β€’ saving validated records of 'var_index'
β€’ saving validated records of 'cell_type_by_expert'
βœ“ 'var_index' is validated against Gene.symbol
β€’ mapping perturbation on ULabel.name
!    2 terms are not validated: 'DMSO', 'IFNG'
β†’ fix typos, remove non-existent values, or save terms via .add_new_from('perturbation')
βœ“ 'cell_type_by_expert' is validated against CellType.name
βœ“ 'cell_type_by_model' is validated against CellType.name

I made the first line of the logging consistent with our convention of lower-case logging messages, @Zethson; also added a link:

Ok, now, upon re-running, I get this. Because some bionty-validated things have been validated already, the logging looks much less verbose:

! indexing datasets with gene symbols can be problematic: https://docs.lamin.ai/faq/symbol-mapping
βœ“ 'var_index' is validated against Gene.symbol
β€’ mapping perturbation on ULabel.name
!    2 terms are not validated: 'DMSO', 'IFNG'
β†’ fix typos, remove non-existent values, or save terms via .add_new_from('perturbation')
βœ“ 'cell_type_by_expert' is validated against CellType.name
βœ“ 'cell_type_by_model' is validated against CellType.name

However, the 3-lines just for perturbation throw me off:

β€’ mapping perturbation on ULabel.name
!    2 terms are not validated: 'DMSO', 'IFNG'
β†’ fix typos, remove non-existent values, or save terms via .add_new_from('perturbation')

Can we get this onto one line? What does "remove non-existent values"? That seems pretty confusing. I'd simply remove this (I get that you mean 'remove values from your AnnData that aren't in the registry', but I can't see that this is a practical case; it's mostly a confusing case.)

Here is a suggestion for compressing 3 lines onto 1 line:

!  'perturbation' has 2 invalid values: 'DMSO', 'IFNG' β†’ fix or .add_new_from('perturbation')

Can you implement this, @sunnyosun?

falexwolf commented 1 day ago

Now I've fixed everything and calling it another time:

! indexing datasets with gene symbols can be problematic: https://docs.lamin.ai/faq/symbol-mapping
βœ“ 'var_index' is validated against Gene.symbol
βœ“ 'perturbation' is validated against ULabel.name
βœ“ 'cell_type_by_expert' is validated against CellType.name
βœ“ 'cell_type_by_model' is validated against CellType.name
! no run & transform got linked, call `ln.track()` & re-run
... storing 'perturbation' as categorical
... storing 'cell_type_by_expert' as categorical
... storing 'cell_type_by_model' as categorical
β†’ returning existing artifact with same hash: Artifact(uid='7pgG6hxGTyNUbcOW0000', is_latest=True, key='datasets/dataset1.h5ad', suffix='.h5ad', type='dataset', size=23352, hash='NYni1vTRM7pqfle8ufPZwQ', n_observations=3, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, storage_id=1, created_by_id=1, created_at=2024-11-25 06:54:11 UTC)
! run input wasn't tracked, call `ln.track()` and re-run

The Curator seems to do something strange because it logs this warning twice:

! run input wasn't tracked, call `ln.track()` and re-run

@sunnyosun, can you look?