Closed smoe closed 2 months ago
nf-core lint
overall result: Passed :white_check_mark: :warning:Posted for pipeline commit bf46ede
+| ✅ 197 tests passed |+
!| ❗ 14 tests had warnings |!
Hey,
the rough structure of the pipeline looks like this:
Various steps in the pipeline produce outputs that need to be added to .obs
or obsm
etc. If the pipeline inputs are really large (10s of GB), always storing the whole AnnData
object with the new info and in the end extracting only the relevant info from all objects will use a lot of disk space. To prevent this, this pipeline only stores the delta in each step and finally aggregates all deltas onto a single anndata object.
So most of the files you pointed out are deltas and also included in the final output. I am not sure how useful it is for people to use them in isolation.
Also we should not only document the output but also make its structure more intuitive. Currently it is still the default of nf-core: Each output is put into a directory named like the prefix (until first underline) of the process creating it. This can and should be changed, but has so far not been relevant for functionality development. An example for doing this properly can be found here.
These were magic words of yours. Would you mind me adding them to this documentation-idea-gathering-branch? That table I came up with I would then split into "use this for CELLxGENE et al." and "secondary intermediary files the contribute to the final 'AnnData' object".
There are different apparent top-level files (./adata/merged_inner.h5ad, ./adata/merged.h5ad, ./adata/merged_outer. h5ad) which were not ultimately obvious to me how these should be interpreted. Would you have some easily accessible description for those? I would then incorporate that into the output.md file.
Yes feel free to add anything you think helps people understand the pipeline to the docs.
I would also like to refer the people to cellxgene less strongly - I mean it is a good way of visualizing it and we should definitely mention it, but we also have own reports. The quality of them will improve soon once I have more time.
For the inner and outer merged objects: Basically when concatenating the various input datasets there are two ways of handling the genes: Either performing an inner or an outer join of the genes present in all datasets - that's what you will find here. Biologists often want to see all their favourite gene in the final object, so the final result will contain the outer join. For integration, only the inner join will be used, as missing genes often lead to a clustering that's more based on the presence of absence of genes, than on actual biological differences. Because of this, the integration is performed based on the intersection. We could also use a limited number of highly variable genes here.
A bit more technical background: When concatenating the anndata objects performing an outer join, the missing values will be filled with 0s instead of NaN
s. This is because 0 values use nearly no memory in the sparse representation. Filling them with NaN
s will make the memory blow up.
Just another iteration.
@smoe, thanks for your work so far. I just merged #71, which does similar things to this PR. However, I am sure we can recycle some things from here to make it better understandable for less experienced users.
I will now close this issue since a basic output documentation has now been established. If you want to add something from here, feel free to reopen this PR or open a new one. Thanks again for your effort, it gave me a lot of inspiration for #71!
Sorry for being barely responsive these days, it is a bit much from too many sides :) So, thank you for proceeding, will look at my run and am confident that new issues arise :)
Adding a pointer to https://github.com/nf-core/scdownstream/issues/2 and adding insights by @nictru on slack to the output.md. Also added a table with files I found create that seemed interesting for which a short description would be of help. Still a skeleton, would wait for some initial comments prior to (doing my best to) filling them.