theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
149 stars 27 forks source link

Initial comments #1

Closed LTLA closed 4 years ago

LTLA commented 4 years ago

Can't say I get the name, but whatever. Anyway, here's my comments:

basilisk.R

You should consider exporting a character vector that defines the set of Python packages you're using. This will allow downstream R packages to easily specify the same dependencies in their calls to BasiliskEnvironment, thus using your package to pin the versions.

Note that, for configure to work, this variable needs to be defined in basilisk.R itself.

Also, while not strictly necessary, it is probably good practice to pin the versions of as many dependencies as you can. Probably a quick conda env list will tell you what to add.

konverter.R

This is what I wanted but you should note that this is not a function that is intended to be run by end-users. In general, it is not valid to pass reticulate objects into basiliskRun calls as there is no guarantee that basilisk will use the same Python process as the one containing those objects. This limitation is necessary for basilisk to guarantee that the code will work, even if you have an incompatible Python process loaded into your current R session.

Rather, the intended use case is for other developers to call zellkonverter:: adata2SCE inside their basiliskRun function bodies, on the adata objects they created. This would save everyone from having to re-define the conversion function. It is also the motivation behind the version control suggested for basilisk.R as it ensures we are all talking about the same version of AnnData.

For this purpose, adata2SCE does not need to protect itself with basiliskStart, as this should be handled by the calling context. Might also want to slap a dot in front to indicate it's for devs.

read.R

This looks fine but note that adata2SCE may not be available if a separate process is loaded. The simple solution is to call zellkonverter:: adata2SCE but this will trigger CHECK warnings about not needing to namespace functions from the same package. The better solution is to move fun into an internal function and then call that internal function from basiliskRun, in which case the package namespace is automatically loaded in the child process.

A good smoke test is to set setBasiliskShared(FALSE) and setBasiliskFork(FALSE) and see whether any of your functions still work. This will trigger the creation of entirely independent processes to execute the function passed to basiliskRun, so if your code relied on shared memory or contexts, it will fail hard.

lazappi commented 4 years ago

Yeah I'm not entirely sold on the name either. It's German for "cell converter" which is the best I could come up with. Couldn't think of any good Python/cell puns.

Thanks for the comments, that's really helpful! I think I understand most of them. I will try and make those changes and see how that goes.

lazappi commented 4 years ago

I think I addressed the first two points but haven't quite fixed the part in read.R. I'm getting an error from parallel::makePSOCKcluster() which doesn't seem related to anything else so I guess I'll have to try and sort that out first.

LTLA commented 4 years ago

Just to be clear, this was what I was proposing for read.R:

readH5AD <- function(file) {
    proc <- basilisk::basiliskStart(anndata_env)
    on.exit(basilisk::basiliskStop(proc))

    file <- path.expand(file)

    basilisk::basiliskRun(proc, fun=.converter, file=file)
}

.converter <- function(file) {
    anndata <- reticulate::import("anndata")
    adata <- anndata$read_h5ad(file)
    AnnData2SCE(adata)
}
LTLA commented 4 years ago

I would help in testing but I'd need a SCE2AnnData utility to go the other way. Probably slap some examples together with some of the smaller datasets in scRNAseq and I can play around with it.

lazappi commented 4 years ago

I have written basic SCE2AnnData() and WriteH5AD() functions but no examples. Should probably start a vignette at some point and need some test datasets...

I have written the reader/writer functions to call an internal function but not sure that is enough for things to work with environment sharing turned off. Need to play around with that a bit more.

LTLA commented 4 years ago

It seems to work well but we should add some tests that run with those flags on.