Sparse matrices for cbImportScanpy - Githubissues

maximilianh / cellBrowser

main repo: https://github.com/ucscGenomeBrowser/cellBrowser/ - Python pipeline and Javascript scatter plot library for single-cell datasets, http://cellbrowser.rtfd.org

https://github.com/ucscGenomeBrowser/cellBrowser/

GNU General Public License v3.0

102 stars 40 forks source link

Sparse matrices for cbImportScanpy #231

Closed redst4r closed 2 years ago

redst4r commented 2 years ago

Fixing #230 : Export the gene expression matrix in sparse format, saving time and disk space

Using 'matrix.mtx.gz' as the new expression matrix exposed two other bugs:

running cbBuild on an existing build will fail with a missing key exception (outConf["fileVersions"] does not exist in matrixOrSamplesHaveChanged())
The logic in MatrixMtxReader.iterrows() to resolve geneIds works differently then in MatrixTsvReader.iterrows(), should be consistent now.

maximilianh commented 2 years ago

Wow, thanks! Looks good, I may change a few small things later with a followup commit.

maximilianh commented 2 years ago

Note that I moved this into the develop branch. We keep the released version in master, and development in "develop". I'm not sure if this is a good convention, often PRs forget to set the branch, but it means that people who clone the repo first see the stable version. Idk.

maximilianh commented 2 years ago

Now after your changes, there is no way to get the .tsv output anymore. I wonder if this is a good idea. Old pipelines may be used to have the .tsv.gz ?

I just looked at an mtx file, this is what it looks like:

9 80 52 1 12 5.122949e+00 1 14 5.871189e+00 1 15 5.310305e+00 1 16 5.264228e+00 1 19 5.385112e+00 2 11 5.893417e+00 2 13 5.397913e+00 2 15 5.310305e+00 2 16 5.264228e+00 2 51 4.631501e+00 2 55 4.586091e+00 2 61 4.406388e+00 2 65 4.307617e+00

Doesn't look very compact to me. Strange that mmwrite doesn't allow us to set the format or get rid of these pointless +00 strings... we'll have a lot of these. Did you compare the file size compared to .tsv.gz ?

maximilianh commented 2 years ago

Hi @redst4r I've made a few changes to this:

cbScanpyImport now has an option -f to specify the format
the default is "tsv" but you can add the line "matrixFormat='mtx'" to your ~/.cellbrowser.conf to change this
cbScanpy got the same option -f, same behavior
I changed the default output filename to features.tsv.gz instead of genes.tsv.gz because that's the new name for Cellranger 3 and also moving forward that will work better for the various other dataset types that we'll have and so we don't ever have to change it again

All of this is in the develop branch for now.

maximilianh commented 2 years ago

Also changed the default branch now to "develop", so future PRs will go to that by default and I don't have to revert commits anymore. I should have done that years ago.

maximilianh commented 2 years ago

Hey @redst4r, this part of the code is problematic:

genes_file = join(path, 'features.tsv.gz')
with gzip.open(genes_file, 'wt') as f:
    f.write("\n".join(genes))

You're not saving the symbols or geneIDs, just one of them. Some users, like @pcm32 really need to keep both the gene IDs and the gene symbols. I think in features.tsv.gz there is a convention to have geneIds first, then symbols, as tab-sep columns (I hope that's correct). Can we make it so such that this information is not lost in the conversion? The issue linked from here has some additional information.

I see that you run geneSeriesToStrings, but that's only useful for .tsv format. For .mtx format, I think the two columns must be tab-separated. Ideally, we would look at an example file from cellranger and imitate their format...?

maximilianh commented 2 years ago

OK, nevermind, I changed this now, getting rid of some code duplication. adding a parameter "sep" to geneSeriesToStrings should have addressed this. I haven't tested this yet, @matthewspeir may have ideas on better example datasets, probably I should test a lot more with .h5ad files.

maximilianh commented 2 years ago

Hi @redst4r, are you happy with the changes I made to your pull request? If not, please don't hesitate to reply here or let us know via cells@ucsc.edu. The changes are made to make sure that existing functionality is not broken. Some users, like @pcm32 use cbBuild in their pipelines and if the default output format is changed, that may break their pipelines. But in principle, I guess we all agree that .mtx.gz is probably the best default format in the long run...

redst4r commented 2 years ago

Hi, sorry was busy last week with other stuff. Making the output format a command line option seems like a good compromise for compatibility. Thanks for checking compatibility in general, I use cellbrowser only in one particular workflow, so it's hard for me to see what those changes break for other people!

About the weird floating point formatting of mmwrite: I don't think it's too big of an issue, compression should take care of most of that overhead. But I guess you could specify mmwrite(field='integer') for integer matrices, or mmwrite(field='real', precision=xxx) otherwise to make it more compact.

All your changes look good to me, please go ahead with it!

maximilianh commented 2 years ago

No, field='integer' is not needed, mmwrite does this automatically, I just tried. as for precision, do you know why you changed the default to precision=7? Looks good to me either way, was just wondering why you changed the default.

redst4r commented 2 years ago

actually, no idea about the precision=7 argument, I might have just copy/pasted that from somewhere. Feel free to reset it to default