maximilianh / cellBrowser

main repo: https://github.com/ucscGenomeBrowser/cellBrowser/ - Python pipeline and Javascript scatter plot library for single-cell datasets, http://cellbrowser.rtfd.org
https://github.com/ucscGenomeBrowser/cellBrowser/
GNU General Public License v3.0
104 stars 40 forks source link

Given a very large AnnData object, what would be the minimal AnnData fields that are required by CellBrowser? #193

Closed pcm32 closed 2 years ago

pcm32 commented 3 years ago

Hi there,

I was given a very large AnnData (~20 GBs) which includes of course tSNE and UMAP coords. I was wondering what parts I could remove safely from it to just feed it to CellBrowser through its AnnData converter (cbBuild or so, being passed an AnnData file). The idea is to make that AnnData more portable but still useful for viewing with CellBrowser.

Thanks!

matthewspeir commented 3 years ago

Sorry for just getting around to this, @pcm32.

Hmm, I'm not sure I fully understand. You want to remove parts of the AnnData to make it smaller but still compatible with the Cell Browser?

pcm32 commented 3 years ago

Yes, exactly.

matthewspeir commented 3 years ago

Interesting! I think for the row attributes (which are the genes?), the only attributes which I think are necessary are the gene symbol and maybe the accessions. For the columns, the cell IDs are obviously important, but other than that just the cluster/celltype annotation columns? You've already mentioned the embeddings.

That should be it for "essential" things: gene symbols/accessions, cell IDs, cluster/celltype, embeddings.

Keep a back-up of the original AnnData just in case. Let me know how it goes!

maximilianh commented 3 years ago

Hi Pablo, can you explain a bit why you think your anndata object would be too large? I can't think of a reason why there would be a problem exporting it, did you run into problems? If you do get errors, we can address them, but we've exported 2M cell atlases without any problem so far.

It may be slow upon the first load. This is due to me preloading all the meta info which I really shouldn't do anymore. So you may want to reduce your meta table to just the essentials or let me know, I should really remove the "preloadMeta()" call on the initial dataset load .

maximilianh commented 3 years ago

There used to be a problem in the Seurat exporter and matrices that are too big for R, but we've worked around this now using .mtx.gz files (like all other packages). I don't think Python ever had a problem with too large matrices.

pcm32 commented 3 years ago

It is mostly for the sake of portability and not using so much unnecessary space on Galaxy instances. Original AnnData I was given was 20 GB, I reduced it to 10 GB by removing the raw and to further 4 GB by saving it with compression. I just didn't want to be moving around and replicating a 20 GB file for visualisation purposes only.

maximilianh commented 3 years ago

Ah, I didn't know you could save without compression, I thought it was the default.

Yes, you can set "raw" to None if you don't want to show it, cbImportScanpy checks for None.

On Fri, Oct 30, 2020 at 2:03 PM Pablo Moreno notifications@github.com wrote:

It is mostly for the sake of portability and not using so much unnecessary space on Galaxy instances. Original AnnData I was given was 20 GB, I reduced it to 10 GB by removing the raw and to further 4 GB by saving it with compression. I just didn't want to be moving around and replicating a 20 GB file for visualisation purposes only.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/193#issuecomment-719539660, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TMB7JQ4FJCUHTYWJHDSNK2TLANCNFSM4SGAEVLA .

matthewspeir commented 2 years ago

I think we can close this for now? Feel free to reopen if there's more to discuss.