scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 602 forks source link

KeyError: 1 in read_10x_mtx if genes.tsv has only one column #2053

Open brianpenghe opened 2 years ago

brianpenghe commented 2 years ago

I have a similar issue to this comment.

Carraro=sc.read_10x_mtx('/mnt/Carraro',var_names='gene_ids')

Switching to gene_symbols didn't work

Error messages:

--> This might be very slow. Consider passing `cache=True`, which enables much faster reading from a cache file.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/envs/flng/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~/miniconda3/envs/flng/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/miniconda3/envs/flng/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 1

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_29519/245170133.py in <module>
----> 1 Carraro=sc.read_10x_mtx('/mnt/Carraro',var_names='gene_ids')

~/miniconda3/envs/flng/lib/python3.8/site-packages/scanpy/readwrite.py in read_10x_mtx(path, var_names, make_unique, cache, cache_compression, gex_only)
    452     genefile_exists = (path / 'genes.tsv').is_file()
    453     read = _read_legacy_10x_mtx if genefile_exists else _read_v3_10x_mtx
--> 454     adata = read(
    455         str(path),
    456         var_names=var_names,

~/miniconda3/envs/flng/lib/python3.8/site-packages/scanpy/readwrite.py in _read_legacy_10x_mtx(path, var_names, make_unique, cache, cache_compression)
    491     elif var_names == 'gene_ids':
    492         adata.var_names = genes[0].values
--> 493         adata.var['gene_symbols'] = genes[1].values
    494     else:
    495         raise ValueError("`var_names` needs to be 'gene_symbols' or 'gene_ids'")

~/miniconda3/envs/flng/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~/miniconda3/envs/flng/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 1

Any ideas?

brianpenghe commented 2 years ago

It seems to be something about the genes.tsv. I replaced it with another genes.tsv and it didn't produce errors.

ivirshup commented 2 years ago

@brianpenghe, do you have a copy of your original file? Any idea what could have been different?

@dn-ra, would you be able to share the first couple lines of your file, and let me know how it was generated?

brianpenghe commented 2 years ago

@brianpenghe, do you have a copy of your original file? Any idea what could have been different?

@dn-ra, would you be able to share the first couple lines of your file, and let me know how it was generated?

I think I found the cause: When the genes.tsv only has one column it doesn't work and throws this error.

Thanks!

mboisvert1 commented 2 years ago

@brianpenghe What column did you add to the genes.tsv so that it worked? I currently have a genes.tsv file with one column for the gene names and am getting the same error as you did. Thanks!

flying-sheep commented 2 years ago

If that’s a case that can happen, we should deal with it. @brianpenghe please share a few lines of the file in a code block.

brianpenghe commented 2 years ago

In my case, there were three files: barcodes.tsv genes.tsv matrix.mtx What didn't work was a genes.tsv that looks like this:

AL627309.1
AL669831.5
LINC00115
FAM41C
AL645608.3
SAMD11
NOC2L
KLHL17
PLEKHN1
PERM1

What worked was a genes.tsv that looks like this:

ENSG00000243485 MIR1302-2HG
ENSG00000237613 FAM138A
ENSG00000186092 OR4F5
ENSG00000238009 AL627309.1
ENSG00000239945 AL627309.3
ENSG00000239906 AL627309.2
ENSG00000241599 AL627309.4
ENSG00000236601 AL732372.1
ENSG00000284733 OR4F29
ENSG00000235146 AC114498.1

So I had to import the data with the latter genes.tsv and then replaced the var.names with the correct genes.

I noticed that the sc.read_10x_mtx function can read both .gz or text formats and decide on their own what format they are. Whether the gene file name is genes.tsv or 'features.tsv' also matters.

Any ideas?

dn-ra commented 2 years ago

I've fixed the error I was getting, which was posted on another issue and referenced here. Here's the solution that worked for me: https://github.com/scverse/scanpy/issues/1916#issuecomment-1286404697

chloesavignac commented 5 months ago

I encountered the same error (KeyError: 1) when trying to load the .mtx file with scanpy.read_10x_mtx(). After several unsuccessful attempts at renaming the columns and indices in the 'genes.tsv' file in different ways, I found a workaround that worked for me:

  1. Import the .mtx file separately using scanpy.read_mtx().
  2. Convert the imported data to a pandas DataFrame using .to_df().
  3. Manually name the columns and indices using the 'barcodes.tsv' and 'features.tsv' files, respectively.

This approach allowed me to bypass the KeyError and successfully load the data.