scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.87k stars 596 forks source link

read_10x_mtx v3 non-gzipped #1731

Closed Hrovatin closed 3 years ago

Hrovatin commented 3 years ago

In function read_10x_mtx there could be an option to search for non-gzipped files when reading v3 10x. Currently, I have files barcodes.tsv features.tsv matrix.mtx, but the function will not read them as they are not gzipped. ...

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-8-72e92bd46023> in <module>
----> 1 adata=sc.read_10x_mtx(path,
      2                       var_names='gene_symbols',
      3                       make_unique=True,
      4                       cache=False,
      5                       cache_compression=None,

~/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/scanpy/readwrite.py in read_10x_mtx(path, var_names, make_unique, cache, cache_compression, gex_only, prefix)
    468     genefile_exists = (path / f'{prefix}genes.tsv').is_file()
    469     read = _read_legacy_10x_mtx if genefile_exists else _read_v3_10x_mtx
--> 470     adata = read(
    471         str(path),
    472         var_names=var_names,

~/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/scanpy/readwrite.py in _read_v3_10x_mtx(path, var_names, make_unique, cache, cache_compression, prefix)
    530     """
    531     path = Path(path)
--> 532     adata = read(
    533         path / f'{prefix}matrix.mtx.gz',
    534         cache=cache,

~/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/scanpy/readwrite.py in read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, cache_compression, **kwargs)
    110     filename = Path(filename)  # allow passing strings
    111     if is_valid_filename(filename):
--> 112         return _read(
    113             filename,
    114             backed=backed,

~/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/scanpy/readwrite.py in _read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, cache_compression, suppress_cache_warning, **kwargs)
    713 
    714     if not is_present:
--> 715         raise FileNotFoundError(f'Did not find file {filename}.')
    716     logg.debug(f'reading {filename}')
    717     if not cache and not suppress_cache_warning:

FileNotFoundError: Did not find file /storage/groups/ml01/projects/2020_pancreas_karin.hrovatin/data/pancreas/scRNA/islets_aged_fltp_iCre/rev6/cellranger/MUC13974/count_matrices/filtered_feature_bc_matrix/matrix.mtx.gz.

But I have 10x files there:


ls /storage/groups/ml01/projects/2020_pancreas_karin.hrovatin/data/pancreas/scRNA/islets_aged_fltp_iCre/rev6/cellranger/MUC13974/count_matrices/filtered_feature_bc_matrix/
barcodes.tsv  features.tsv  matrix.mtx
wflynny commented 3 years ago

Is there a reason not to gzip these files? While they aren't big, you get ~4 fold storage savings and most text inspection tools (e.g. less) can read gzipped files.

Hrovatin commented 3 years ago

We get non gzipped files from core facility and they also seemed to have worked with non gzipped files beforehand in the workflow. But maybe there was another reason for them to not use gzipped files (maybe older reading functions or something).

LuckyMD commented 3 years ago

At some point scanpy switched to non-gzipped files by default as file I/O is faster that way. Reading files quickly was regarded as a more important that storage minimization. I guess it's always a memory vs storage question.

ivirshup commented 3 years ago

I'm not super into the idea of supporting much beyond exactly what cellranger outputs for these functions. We expect very specific things from these files, and I think it's difficult to say what's a reasonable amount of modification once we start supporting any.

I'd be open to exposing some of the internally used functions so it's easier to write a custom reading function here, if that's a reasonable alternative to you? I think all that we'd really expose here is a faster scipy.io.mmread, since the other files are read with pd.read_csv.


@LuckyMD

At some point scanpy switched to non-gzipped files by default as file I/O is faster that way. Reading files quickly was regarded as a more important that storage minimization.

I believe this only applies to writing h5ad. But really, lzf is probably ideal here. It's much faster than gzip, has similar compression, and is barely slower than no compression. But lzf is vendored with h5py not hdf5 (last I checked), so you might not be able to read a file compressed that way from R or something else.

Hrovatin commented 3 years ago

That is a fair point. Maybe we can just close this issue then?

ivirshup commented 3 years ago

That sounds good to me. Let me know if you'd like me to point out our mtx reading function. It's basically just calling pandas to read the coo array, since pandas' csv parser is much faster than scipy's

LuckyMD commented 3 years ago

I believe this only applies to writing h5ad.

Hmm... i think i must have misunderstood at the time then... i thought it was faster for both.