maximilianh / cellBrowser

main repo: https://github.com/ucscGenomeBrowser/cellBrowser/ - Python pipeline and Javascript scatter plot library for single-cell datasets, http://cellbrowser.rtfd.org
https://github.com/ucscGenomeBrowser/cellBrowser/
GNU General Public License v3.0
102 stars 40 forks source link

Sparse matrices for cbImportScanpy #231

Closed redst4r closed 2 years ago

redst4r commented 2 years ago

Fixing #230 : Export the gene expression matrix in sparse format, saving time and disk space

Using 'matrix.mtx.gz' as the new expression matrix exposed two other bugs:

maximilianh commented 2 years ago

Wow, thanks! Looks good, I may change a few small things later with a followup commit.

maximilianh commented 2 years ago

Note that I moved this into the develop branch. We keep the released version in master, and development in "develop". I'm not sure if this is a good convention, often PRs forget to set the branch, but it means that people who clone the repo first see the stable version. Idk.

maximilianh commented 2 years ago

Now after your changes, there is no way to get the .tsv output anymore. I wonder if this is a good idea. Old pipelines may be used to have the .tsv.gz ?

I just looked at an mtx file, this is what it looks like:

9 80 52 1 12 5.122949e+00 1 14 5.871189e+00 1 15 5.310305e+00 1 16 5.264228e+00 1 19 5.385112e+00 2 11 5.893417e+00 2 13 5.397913e+00 2 15 5.310305e+00 2 16 5.264228e+00 2 51 4.631501e+00 2 55 4.586091e+00 2 61 4.406388e+00 2 65 4.307617e+00

Doesn't look very compact to me. Strange that mmwrite doesn't allow us to set the format or get rid of these pointless +00 strings... we'll have a lot of these. Did you compare the file size compared to .tsv.gz ?

maximilianh commented 2 years ago

Hi @redst4r I've made a few changes to this:

All of this is in the develop branch for now.

maximilianh commented 2 years ago

Also changed the default branch now to "develop", so future PRs will go to that by default and I don't have to revert commits anymore. I should have done that years ago.

maximilianh commented 2 years ago

Hey @redst4r, this part of the code is problematic:

genes_file = join(path, 'features.tsv.gz')
with gzip.open(genes_file, 'wt') as f:
    f.write("\n".join(genes))

You're not saving the symbols or geneIDs, just one of them. Some users, like @pcm32 really need to keep both the gene IDs and the gene symbols. I think in features.tsv.gz there is a convention to have geneIds first, then symbols, as tab-sep columns (I hope that's correct). Can we make it so such that this information is not lost in the conversion? The issue linked from here has some additional information.

I see that you run geneSeriesToStrings, but that's only useful for .tsv format. For .mtx format, I think the two columns must be tab-separated. Ideally, we would look at an example file from cellranger and imitate their format...?

maximilianh commented 2 years ago

OK, nevermind, I changed this now, getting rid of some code duplication. adding a parameter "sep" to geneSeriesToStrings should have addressed this. I haven't tested this yet, @matthewspeir may have ideas on better example datasets, probably I should test a lot more with .h5ad files.

maximilianh commented 2 years ago

Hi @redst4r, are you happy with the changes I made to your pull request? If not, please don't hesitate to reply here or let us know via cells@ucsc.edu. The changes are made to make sure that existing functionality is not broken. Some users, like @pcm32 use cbBuild in their pipelines and if the default output format is changed, that may break their pipelines. But in principle, I guess we all agree that .mtx.gz is probably the best default format in the long run...

redst4r commented 2 years ago

Hi, sorry was busy last week with other stuff. Making the output format a command line option seems like a good compromise for compatibility. Thanks for checking compatibility in general, I use cellbrowser only in one particular workflow, so it's hard for me to see what those changes break for other people!

About the weird floating point formatting of mmwrite: I don't think it's too big of an issue, compression should take care of most of that overhead. But I guess you could specify mmwrite(field='integer') for integer matrices, or mmwrite(field='real', precision=xxx) otherwise to make it more compact.

All your changes look good to me, please go ahead with it!

maximilianh commented 2 years ago

No, field='integer' is not needed, mmwrite does this automatically, I just tried. as for precision, do you know why you changed the default to precision=7? Looks good to me either way, was just wondering why you changed the default.

redst4r commented 2 years ago

actually, no idea about the precision=7 argument, I might have just copy/pasted that from somewhere. Feel free to reset it to default