laminlabs / cellxgene-lamin

Access the cellxgene data using LaminDB.
https://docs.lamin.ai/cellxgene
Apache License 2.0
5 stars 0 forks source link

Is there a tutorial for customizing `laminlabs/cellxgene` collection? #96

Open wehos opened 2 days ago

wehos commented 2 days ago

Dear developers,

Thanks for publishing such an amazing job. I am encountering an issue when I tried to train an ML model on a subset of CellxGene atlas. The MappedCollection workflow in LaminDB only supports a collection, which means I have to create a new collection from artifacts if I want to do the training on a customized subset of CellxGene atlas. I consider /laminlabs/cellxgene as a good starting point but as long as I load /laminlabs/cellxgene, I don't have permission to create a new dataset.

If I create a new local instance, and pull those artifacts from the /laminlabs/cellxgene or S3 buckets, it also does not work before the schema is not automatically set up.

Therefore, I'm wondering:

(1) Is there a solution for cloning everything (not only h5ad files) from a public database? (2) Alternatively, could you please provide some guidance for parsing the cellxgene atlas from scratch so that I can reproduce a local instance?

Thanks!

sunnyosun commented 1 day ago

Hi @wehos,

Thank you for the questions, we are glad that you find Lamin useful :D!

Before answering your question, I'd like to first point you to our transfer guide, which shows how to transfer artifacts across instances.

If I understand correctly, you would like to train your model using MappedCollection on a subset of cxg artifacts. Here is how you can create a local instance, and then transfer artifacts from laminlabs/cellxgene.

On CLI, init an instance called mydata (make sure you installed the latest lamindb)

lamin init --storage <path-to-a-local-dir> --name mydata --schema bionty

Now make sure your local instance is loaded, also see transfer guide.

import lamindb as ln

# query your subset from celxgene instance with `using`
artifacts = ln.Artifact.using("laminlabs/cellxgene").filter(...).all()

# save the artifacts to your local instance
for artifact in artifacts:
    artifact.save()

# you can create collections in your local instance and continue

Feel free to let us know if you have more questions and if we can clarify more here!

Koncopd commented 1 day ago

Hello @wehos . Also in the next release it will be possible to create a collection and call .mapped without saving the collection first https://github.com/laminlabs/lamindb/pull/1942.

collection = ln.Collection(artifacts_list, name="use mapped")
mapped = collection.mapped()
wehos commented 1 day ago

Hi @wehos,

Thank you for the questions, we are glad that you find Lamin useful :D!

Before answering your question, I'd like to first point you to our transfer guide, which shows how to transfer artifacts across instances.

If I understand correctly, you would like to train your model using MappedCollection on a subset of cxg artifacts. Here is how you can create a local instance, and then transfer artifacts from laminlabs/cellxgene.

On CLI, init an instance called mydata (make sure you installed the latest lamindb)

lamin init --storage <path-to-a-local-dir> --name mydata --schema bionty

Now make sure your local instance is loaded, also see transfer guide.

import lamindb as ln

# query your subset from celxgene instance with `using`
artifacts = ln.Artifact.using("laminlabs/cellxgene").filter(...).all()

# save the artifacts to your local instance
for artifact in artifacts:
    artifact.save()

# you can create collections in your local instance and continue

Feel free to let us know if you have more questions and if we can clarify more here!

Thanks for your suggestion! However, I am afraid this does not work directly. For example, if I want to select all the human data from cellxgene. According to the tutorial, I should do this:

organisms = bt.Organism.lookup()
ln.Artifact.filter(organisms=organisms.human)

However, this does not work because the bt.Organism.lookup() returns an empty result. For this specific situation, is there any quick solution?

=========Update

Well, I did find a quick fix, which is to leverage Pickle to cache the organisms.human from your official cellxgene instance and then connect to my local instance.

wehos commented 1 day ago

Hello @wehos . Also in the next release it will be possible to create a collection and call .mapped without saving the collection first laminlabs/lamindb#1942.

collection = ln.Collection(artifacts_list, name="use mapped")
mapped = collection.mapped()

Awesome! This should perfectly solve the issue. Perhaps I should try to build my lamindb from the latest source codes. Thanks for letting me know!

Koncopd commented 1 day ago

@wehos about the empty organism lookup. Does your local instance have bionty schema? It has to be initialized with the schema like @sunnyosun showed (--schema bionty). lamin init --storage <path-to-a-local-dir> --name mydata --schema bionty

You can check the schema on your local instance by looking at the output of ln.setup.settings.instance.schema

wehos commented 1 day ago

ln.setup.settings.instance.schema

It was initialized and ln.setup.settings.instance.schema returns {'bionty'}. Meanwhile, bt.Organism.lookup() does not throw an error, it just stays empty Lookup(dict=<bound method Lookup.dict of <lamin_utils._lookup.Lookup object at 0x7fd4b2973950>>)) when no data was injected.

As a result, I will have to pull a few data from the remote instance to set up the lookup table before I can select.

falexwolf commented 1 day ago

You'll need to get some records into your bionty registries. Easiest way is through

bt.Organism.import_from_source()

https://docs.lamin.ai/bionty.organism#bionty.Organism.import_from_source

sunnyosun commented 1 day ago

Could you load your local instance, and then filter artifacts like below?

organisms = bt.Organism.using("laminlabs/cellxgene").lookup()
artifacts = ln.Artifact.using("laminlabs/cellxgene").filter(organisms=organisms.human)

Both filter and organism lookup need to be done with the cellxgene instance via .using("laminlabs/cellxgene"). Otherwise all the operations are using your local instance (the instance that is loaded, you can check the loaded instance with CLI lamin info).

When you transfer artifacts, their linked metadata records are also transferred. You can then do filters and lookups on your local instance.

falexwolf commented 1 day ago

New release is out, @wehos

https://docs.lamin.ai/changelog/2024#db-0-76-7-bionty-0-50-2