High-level structure of docs

falexwolf commented 14 hours ago

Where to put "modeling perturbations"?

Currently it's here:

In my opinion, "perturbation" is not a "data type" and doesn't fit the other documents under this heading.

Given we agreed to keep perturbation modeling as part of the wetlab schema, we could make a "Manage wetlab registries" guide in analogy to "Manage biological registries". This seems cumbersome though.

I think we have a pretty good way of showing things under "Atlases" for "cellxgene". It first shows what you can do, and then second shows how you curate things for it.

How about we make a "Perturbations atlas" sub-heading and add the current doc under it in complete analogy to:

@Zethson?

Data types

We could re-debate whether we want to use "data type" at all or use the more precise "data modality".

In other places, we distanced ourselves from the term "data type" for lack of imprecision. Also HuggingFace doesn't use it.

See:

Discussions:

Zethson commented 14 hours ago

In my opinion, "perturbation" is not a "data type"

Pretty sure that colloquially people use the term as such. But yeah, it's not a standalone data type for sure.

I think we have a pretty good way of showing things under "Atlases" for "cellxgene". It first shows what you can do, and then second shows how you curate things for it.

The potential new one could go there but certainly not what we currently have. I know that we're preparing this for a specific client but I worry that people that want to curate perturbation datasets would not look under this header. I think that it's very unlikely that they'll find it to be honest.

I'd vote more in favor of making this less strict and go with "data modality" or something else.

falexwolf commented 13 hours ago

The potential new one could go there but certainly not what we currently have. I know that we're preparing this for a specific client but I worry that people that want to curate perturbation datasets would not look under this header. I think that it's very unlikely that they'll find it to be honest.

How about putting it under "Curate datasets" then as an example? And then in the scrna guide under "data types" one could add a cross-link and say "if you'd like to curate a dataset that has perturbation information, see ...".

I'd vote more in favor of making this less strict and go with "data modality" or something else.

"Data modality" is more precise than "data type" and excludes "perturbation". You can have perturbational data and read out with an imaging modality etc.

Zethson commented 13 hours ago

How about putting it under "Curate datasets" then as an example? And then in the scrna guide under "data types" one could add a cross-link and say "if you'd like to curate a dataset that has perturbation information, see ...".

I like that even less. To be honest, I like the current guide where it is and am much rather open to adapting the header to cover "perturbation" as well.

My proposal is:

Add a new perturbation guide to CxG that uses the new upcoming Curator.
Generalize the existing perturbation guide (rewrite) and keep it under "data type" (or a new more general header). This could link to the CxG version.

"Data modality" is more precise than "data type" and excludes "perturbation".

I get it although at least theislab abuses this term regularly and states that "perturbation" is a modality. What about "data domain", "experimental modality", or really just keeping "data type"?

I really like the organized content under the current header because people ask "I have spatial data - where's the lamin guide for it" or "I have perturbation data - where's the lamin guide for it"? I think that it's great that all of these are under a single header and not more dispersed in the docs.

falexwolf commented 12 hours ago

I really like the organized content under the current header because people ask "I have spatial data - where's the lamin guide for it" or "I have perturbation data - where's the lamin guide for it"? I think that it's great that all of these are under a single header and not more dispersed in the docs.

I see and get it.

If this is what we want, I'd call the header "Data journeys" or something like this. It's where you go to find complete guides to handle "scRNA-seq" or "perturbational data" or "spatial data" etc. and you don't expect these to be mutually exclusive.

We could even call "Data types" just "Use cases" and ditch or hide all this nebulous material that needs to be completely re-worked:

I mean the "Atlas" section with "CELLxGENE" is great, but it's not a use case, it's more an example of what you can do when you follow the scRNA guides. Sure, one could say "Query CELLxGENE" is a use case. 🤔

I have an idea: All of what we currently label as "How to" are the reference guides on the atomic operations within the workflow:

Everything under what's currently labeled "Data types" give users the full workflow for close-to-real-world examples.

How about we bring both together under "How to" and call one:

HOW TO

Basic usage
- Install & setup
- Query & search
- Manage notebooks & scripts
- Transfer data
- Curate datasets
- Manage biological registries (and under it, public ontologies)
- Manage schema modules 

Data journeys
- scRNA-seq
- ...
- Perturbational data

Integrations
- Pipeline managers
- Visualization
- MLOps

... ahh, but of course there are problems with this as well.

I don't have the time to properly think through this right now; let's come back here another time and keep things as they are for now.

laminlabs / lamin-docs

High-level structure of docs #193

Where to put "modeling perturbations"?

Data types