maximilianh / cellBrowser

main repo: https://github.com/ucscGenomeBrowser/cellBrowser/ - Python pipeline and Javascript scatter plot library for single-cell datasets, http://cellbrowser.rtfd.org
https://github.com/ucscGenomeBrowser/cellBrowser/
GNU General Public License v3.0
102 stars 40 forks source link

Exporting the gene expression matrix as sparse matrix (cbImportScanpy, cbBuild)? #230

Closed redst4r closed 2 years ago

redst4r commented 2 years ago

Hi,

I've been visualizing quite large datasets (>100k cells) with cellbrowser (from scanpy .h5ad files) and building cellbrowser for these large data takes a significant amount of time. In particular writing the full gene expression matrix to disk in plain tsv (see anndataMatrixToTsv()) takes forever.

Going through the code that's called by cbImportScanpy and cbBuild I actually don't see the need of saving the expression matrix to disk in "dense" format (i.e. plaintext with all the zeros in there).

We could just export it in "matrix-market" format (e.g. via scipy.io.mmwrite) which would save alot of time for sparsely populated matrices (like the raw counts). Even for dense (processed/normalized/filtered) matrices, it probably wouldn't be any worse then the plain dense format.

The code for the "import" step looks pretty easy to change, just writing the matrix via scipy.io.mmwrite, and the cell-ids and genenames in separate files. Not so sure about the "build" part, that looks convoluted.

Let me know what you thing!

maximilianh commented 2 years ago

Hi, thanks for opening this issue and the feedback. I must admit that I've never thought about that. We do use .mtx.gz for Seurat but not Scanpy. You're right it would make a lot of sense to use .mtx.gz here, too. Do you want to have a try ? I won't have the time before January I'm afraid...

On Fri, Nov 19, 2021 at 10:57 PM redst4r @.***> wrote:

Hi,

I've been visualizing quite large datasets (>100k cells) with cellbrowser (from scanpy .h5ad files) and building cellbrowser for these large data takes a significant amount of time. In particular writing the full gene expression matrix to disk in plain tsv (see anndataMatrixToTsv()) takes forever.

Going through the code that's called by 'cbImportScanpyandcbBuild` I actually don't see the need of saving the expression matrix to disk in "dense" format (i.e. plaintext with all the zeros in there).

We could just export it in "matrix-market" format (e.g. via scipy.io.mmwrite) which would save alot of time for sparsely populated matrices (like the raw counts). Even for dense (processed/normalized/filtered) matrices, it probably wouldn't be any worse then the plain dense format.

The code for the "import" step looks pretty easy to change, just writing the matrix via scipy.io.mmwrite, and the cell-ids and genenames in separate files. Not so sure about the "build" part, that looks convoluted.

Let me know what you thing!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TLFZCZDXJPEUD2FQITUM3B5ZANCNFSM5INDAYYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

redst4r commented 2 years ago

yeah, I'll give it a shot. I also noticed that the seurat export writes sparse matrix anyways, so the cbBuild part of cellbrowser must already be working with matrix-market (mtx) files. That makes it alot easier. I'll try to modify the cbImportScanpy part!

maximilianh commented 2 years ago

OK, let me know if you run into any problem. I'll ping this ticket again next week and check back with you. Otherwise I can also have a go at this, as you say, it shouldn't be hard.

On Sun, Nov 21, 2021 at 12:36 AM redst4r @.***> wrote:

yeah, I'll give it a shot. I also noticed that the seurat export writes sparse matrix anyways, so the cbBuild part of cellbrowser must already be working with matrix-market (mtx) files. That makes it alot easier. I'll try to modify the cbImportScanpy part!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/230#issuecomment-974726896, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TNUVFQLHMYFHSHVI63UNAWJVANCNFSM5INDAYYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

maximilianh commented 2 years ago

Many thanks for the pull request, I think we can close this ticket and continue the conversation in the ticket.