Reading 10x scRNA-seq with read_10x_mtx()

areejalsaafin commented 3 years ago

I get an error when I use sc.read_10x_mtx() to read scRNA-seq files (matrix.mtx.gz, barcodes.tsv.gz, and features.tsv.gz) generated by "Illumina HiSeq 4000". The function works fine when I read files generated by other 10x platforms such as "Illumina NovaSeq 6000".

Example:

sc.read_10x_mtx("GSE145328_RAW")

Error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 2

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-20-26443e0aed95> in <module>
----> 1 rnaseq1 = sc.read_10x_mtx("GSE145328_RAW")

~/anaconda3/lib/python3.8/site-packages/scanpy/readwrite.py in read_10x_mtx(path, var_names, make_unique, cache, cache_compression, gex_only, prefix)
    479     genefile_exists = (path / f'{prefix}genes.tsv').is_file()
    480     read = _read_legacy_10x_mtx if genefile_exists else _read_v3_10x_mtx
--> 481     adata = read(
    482         str(path),
    483         var_names=var_names,

~/anaconda3/lib/python3.8/site-packages/scanpy/readwrite.py in _read_v3_10x_mtx(path, var_names, make_unique, cache, cache_compression, prefix)
    560     else:
    561         raise ValueError("`var_names` needs to be 'gene_symbols' or 'gene_ids'")
--> 562     adata.var['feature_types'] = genes[2].values
    563     adata.obs_names = pd.read_csv(path / f'{prefix}barcodes.tsv.gz', header=None)[
    564         0

~/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3022             if self.columns.nlevels > 1:
   3023                 return self._getitem_multilevel(key)
-> 3024             indexer = self.columns.get_loc(key)
   3025             if is_integer(indexer):
   3026                 indexer = [indexer]

~/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
   3083 
   3084         if tolerance is not None:

KeyError: 2

Versions

scanpy==1.8.0 anndata==0.7.6 umap==0.5.1 numpy==1.19.2 scipy==1.6.3 pandas==1.2.4 scikit-learn==0.24.2 statsmodels==0.12.2 python-igraph==0.9.1 pynndescent==0.5.2

[Paste the output of scanpy.logging.print_versions() leaving a blank line after the details tag]

ivirshup commented 3 years ago

Thanks for the report!

First question: Is this directly out of cell ranger? Or is that the supplementary file from this GEO entry (GSE145328)? If this is the entry from GEO, I wouldn't expect sc.read_10x_mtx, since that doesn't look like the output of cellranger.

areejalsaafin commented 3 years ago

GSE145328 is from NCBI GEO. GSE139555_RAW.tar is another example that gives the same error. These are the output files generated by Cell Ranger and uploaded to NCBI GEO, based on the authors' notes in their paper.

Some examples from NCBI GEO that can be read without errors: GSE173231 and GSE158803.

hurleyLi commented 3 years ago

So the problem is actually from GEO. When people submitted the files processed by Cellranger version 2, they gzip-ed the files. However when Scanpy sees .gz file it recognized the version as Cellranger version 3 by default, which is a little bit different from the version 2 format.

All you need to do is just to gunzip the matrix.mtx.gz, barcodes.tsv.gz, and genes.tsv.gz files. You can tell whether the data was processed using version 2 or 3 from the description in GEO, or by the name of the uploaded gene files: genes.tsv (version 2, could be genes.tsv.gz) or features.tsv.gz (version 3)

dn-ra commented 2 years ago

Hello,

I am having a similar issue with read_10x_mtx, except my error is showing Key Error: 1 and it is from 10X data we produced ourselves, do doesn't involve a GEO incompabitibility.

Error below:


>>> juno = sc.read_10x_mtx(path = "../../Data/juno_influenza_pilot/hash_t_cells/umi_count")
Traceback (most recent call last):
  File "/home/daniel/miniconda/envs/scvi/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 2131, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 2140, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/daniel/miniconda/envs/scvi/lib/python3.8/site-packages/scanpy/readwrite.py", line 481, in read_10x_mtx
    adata = read(
  File "/home/daniel/miniconda/envs/scvi/lib/python3.8/site-packages/scanpy/readwrite.py", line 552, in _read_v3_10x_mtx
    var_names = genes[1].values
  File "/home/daniel/miniconda/envs/scvi/lib/python3.8/site-packages/pandas/core/frame.py", line 3455, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/daniel/miniconda/envs/scvi/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 1

ghost commented 2 years ago

thanks a lot @hurleyLi, dzip -d the zipped files fix the problem for me.

ghost commented 2 years ago

i would add a pull request but i see there are 44 already to go...

flying-sheep commented 2 years ago

pandas transparently handles gzipping. We could also simply add support for gzipped files:

def maybe_zipped_path(path: Path) -> Path:
    zipped_path = Path(f'{path}.gz')
    if zipped_path.is_file(): return zipped_path
    return path

And then we use it:

https://github.com/scverse/scanpy/blob/a38a22ab85074f17788b8d1effa89c1373e0c978/scanpy/readwrite.py#L488

to

genefile_exists = maybe_zipped_path(path / f'{prefix}genes.tsv').is_file()

and

https://github.com/scverse/scanpy/blob/a38a22ab85074f17788b8d1effa89c1373e0c978/scanpy/readwrite.py#L525

to

genes = pd.read_csv(maybe_zipped_path(path / f'{prefix}genes.tsv'), header=None, sep='\t')

and so on.

dn-ra commented 1 year ago

I've figured out what was causing my error. The scanpy function to read in features.tsv.gz expects three columns: ['gene_symbols, 'gene_ids', 'feature_types'] Where 'feature types' is a text string like 'Gene Expression' and usually repeated along the whole length of the file. The file I was reading in was from HTO data and only had one column:

Hashtag1-GTCAACTCTTTAGCG Hashtag2-TTCCGCCTCTCTTTG Hashtag3-AAGTATCGTTTCGCA unmapped

So if others run into this same error, just add in some extra columns to the features.tsv file so it doesn't error out when looking for the extra columns. Something like this (different features file):

RP11-34P13.7 RP11-34P13.7 Gene Expression FO538757.3 FO538757.3 Gene Expression FO538757.2 FO538757.2 Gene Expression AP006222.2 AP006222.2 Gene Expression RP4-669L17.10 RP4-669L17.10 Gene Expression

It would also be helpful if scanpy would validate the number of columns at the start. At the moment it looks like it reads in the whole .mtx file before trying to map the feature names and producing this error, so it takes a while to fail.

VladimirShitov commented 1 year ago

Faced exactly the same problem with file "GSE185477_GSM3178784_C41_SC_raw_counts.zip" from GSE185477

The advice from @hurleyLi helped, thanks a lot! But the error is quite confusing. It would be great to read files without extra actions for the user, but is it possible to at least change the error message? E.g.

try:
    ....
except KeyError:
    raise KeyError("Unexpected error, probably due to Cellranger version. Make sure to unarchive gzipped file coming from Cellranger v2")

echoo07 commented 1 year ago

I've figured out what was causing my error. The scanpy function to read in features.tsv.gz expects three columns: ['gene_symbols, 'gene_ids', 'feature_types'] Where 'feature types' is a text string like 'Gene Expression' and usually repeated along the whole length of the file. The file I was reading in was from HTO data and only had one column:

Hashtag1-GTCAACTCTTTAGCG Hashtag2-TTCCGCCTCTCTTTG Hashtag3-AAGTATCGTTTCGCA unmapped

So if others run into this same error, just add in some extra columns to the features.tsv file so it doesn't error out when looking for the extra columns. Something like this (different features file):

RP11-34P13.7 RP11-34P13.7 Gene Expression FO538757.3 FO538757.3 Gene Expression FO538757.2 FO538757.2 Gene Expression AP006222.2 AP006222.2 Gene Expression RP4-669L17.10 RP4-669L17.10 Gene Expression

It would also be helpful if scanpy would validate the number of columns at the start. At the moment it looks like it reads in the whole .mtx file before trying to map the feature names and producing this error, so it takes a while to fail.

when you do add those columns it makes the number of genes to 0 while reading it!

Laura-Waechter commented 1 year ago

Hello,

my issue is similar to the ones above, so I hope you can help me with that.

I tried to read in some sample data from a liver cell database (Liver Cell Atlas) with the function sc.read_10x_mtx. When I pass in the data just like that, I get a KeyError: 1. I adjusted the file to have two more columns (with numbers 1:x as gene IDs and a feature types column with "Gene Expression" in all rows) , but then I get the following Error: ValueError: Length of passed value for var_names is 31054, but this AnnData has shape: (389056, 31053). I think this is because the function treats the column names as values, but I don’t know why. When I look at the features file with pandas, it is displayed correctly.

Any suggestions? Thanks a lot in advance!

YuchenXiangEMBL commented 1 year ago

Not sure whether it is resolved, just put here another solution to read_mtx and add to anndata one by one

import pandas as pd
import scanpy as sc

adata = sc.read_mtx('./matrix.mtx')
adata_bc=pd.read_csv('./barcodes.tsv',header=None)
adata_features=pd.read_csv('./features.tsv',header=None)
adata= adata.T
adata.obs['cell_id']= adata_bc
adata.var['gene_name']= adata_features[0].tolist()
adata.var.index= adata.var['gene_name']

Laura-Waechter commented 1 year ago

All right, I will try that. Thanks

randymi01 commented 1 month ago

Not sure whether it is resolved, just put here another solution to read_mtx and add to anndata one by one

import pandas as pd
import scanpy as sc

adata = sc.read_mtx('./matrix.mtx')
adata_bc=pd.read_csv('./barcodes.tsv',header=None)
adata_features=pd.read_csv('./features.tsv',header=None)
adata= adata.T
adata.obs['cell_id']= adata_bc
adata.var['gene_name']= adata_features[0].tolist()
adata.var.index= adata.var['gene_name']

set the delimiter for the features to tab, then use the second column which contains the gene names and not the gene ensembl id. Using the gene names is better for downstream qc since the scanpy recommended pipeline uses the gene name prefixes to identify mitochondrial genes.

adata_features = pd.read_csv('./barcodes.tsv', header = None, delimiter = '\t')
...

# technically don't need to use .values or tolist() since the mtx and features file should 
# have same number of rows resulting in same index in the adata.var and adata_features dataframes

adata.var['gene_name'] = adata_features[1].values

scverse / scanpy

Reading 10x scRNA-seq with read_10x_mtx() #1916

Versions