scverse / pytometry

Flow & mass cytometry analytics.
https://pytometry.readthedocs.io/en/latest/index.html
Apache License 2.0
42 stars 9 forks source link

pm.io.read_fcs error: KeyError: 'marker' #28

Closed alefrol638 closed 2 years ago

alefrol638 commented 2 years ago

Problem Description:

When trying to import fcs files from a directory using pm.io.read_fcs I get the error: "KeyError: marker". After looking at the fcs files, I found that the marker variables do not start with $ ($P[0-9]) but only with P (P[0-9]). As a workaround I have used the read_fcs package to manually extract the marker names. This works, however when doing the import manually and using pm.pp.split_signal(), the anndata object is deleted.

Error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/opt/anaconda3/envs/pytometry_0.1.2/lib/python3.8/site-packages/pandas/core/indexes/base.py:3629, in Index.get_loc(self, key, method, tolerance)
   3628 try:
-> 3629     return self._engine.get_loc(casted_key)
   3630 except KeyError as err:

File ~/opt/anaconda3/envs/pytometry_0.1.2/lib/python3.8/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File ~/opt/anaconda3/envs/pytometry_0.1.2/lib/python3.8/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'marker'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Input In [14], in <cell line: 5>()
      5 for fcs_file in fcs_files:
      6     if (fcs_file.endswith('.fcs') or fcs_file.endswith('.FCS')):
----> 7         adata_tmp = pm.io.read_fcs(os.path.join(path_data, fcs_file)
      8                             )
     10         #run compensation
     11        # adata_tmp = pm.pp.compute_bleedthr(adata_tmp,)
     12 
   (...)
     31 
     32         #add metadata
     33         sample_id = str(fcs_file.split( '.')[0])

File ~/opt/anaconda3/envs/pytometry_0.1.2/lib/python3.8/site-packages/pytometry/read_write/_readfcs.py:15, in read_fcs(path)
      5 def read_fcs(path: str) -> AnnData:
      6     """Read FCS file and convert into AnnData format.
      7 
      8     Args:
   (...)
     13         an AnnData object of the fcs file
     14     """
---> 15     return readfcs.read(path)

File ~/opt/anaconda3/envs/pytometry_0.1.2/lib/python3.8/site-packages/readfcs/_core.py:180, in read(filepath, reindex)
    168 """Read in fcs file as AnnData.
    169 
    170 Args:
   (...)
    177     an AnnData object
    178 """
    179 fcsfile = ReadFCS(filepath)
--> 180 return fcsfile.to_anndata(reindex=reindex)

File ~/opt/anaconda3/envs/pytometry_0.1.2/lib/python3.8/site-packages/readfcs/_core.py:150, in ReadFCS.to_anndata(self, reindex)
    147 if reindex:
    148     adata.var = adata.var.reset_index()
    149     adata.var.index = np.where(
--> 150         adata.var["marker"].isin(["", " "]),
    151         adata.var["channel"],
    152         adata.var["marker"],
    153     )
    154     mapper = pd.Series(adata.var.index, index=adata.var["channel"])
    155     if self.meta.get("spill") is not None:

File ~/opt/anaconda3/envs/pytometry_0.1.2/lib/python3.8/site-packages/pandas/core/frame.py:3505, in DataFrame.__getitem__(self, key)
   3503 if self.columns.nlevels > 1:
   3504     return self._getitem_multilevel(key)
-> 3505 indexer = self.columns.get_loc(key)
   3506 if is_integer(indexer):
   3507     indexer = [indexer]

File ~/opt/anaconda3/envs/pytometry_0.1.2/lib/python3.8/site-packages/pandas/core/indexes/base.py:3631, in Index.get_loc(self, key, method, tolerance)
   3629     return self._engine.get_loc(casted_key)
   3630 except KeyError as err:
-> 3631     raise KeyError(key) from err
   3632 except TypeError:
   3633     # If we have a listlike key, _check_indexing_error will raise
   3634     #  InvalidIndexError. Otherwise we fall through and re-raise
   3635     #  the TypeError.
   3636     self._check_indexing_error(key)

KeyError: 'marker'

Workaround:

When manually extracting marker names by looking for regex "^P[0-9]+S$" it works, however split_data still doesn't work (see attached code).

Furthermore, if I remove the FSC and SSC and Time from the fcs files in FlowJo and then export them, the import function from pytometry (pm.io.read_fcs) works without any modifications. Split_data is not necessary in this setting.

import readfcs
#fcs=readfcs.ReadFCS(path_data+"/file1.fcs")
fcs=readfcs.ReadFCS(path_data+"/file1.fcs")
adata_tmp=fcs.to_anndata(reindex=False)
      #get surface marker names from metadata (Try to match any string in the meta header
      #keys that starts with`$P` followed by an arbitrary number of digits (one or more
      #times) and ends with `S`)
markers = [[adata_tmp.uns['meta'][string],
           int(re.sub('S$', '', re.sub('^P', '', string)))]
for string in adata_tmp.uns['meta'].keys() if re.match('^P[0-9]+S$', string)]
marker_dict ={}
var_names = adata_tmp.var_names
#adata_tmp.var['dyes'] = adata_tmp.var
#create dictionary for index renaming
for marker, idx in markers:
    marker_dict[var_names[idx-1]] = marker #idx - 1 as Python is 0-index as start
#rename fluorescent dyes with the marker name
adata_tmp.var.rename(index=marker_dict, inplace=True)

#move Time, FSC, SSC to obs

###does not work, adata is deleted
#adata_tmp = pm.pp.split_signal(adata_tmp)
mbuttner commented 2 years ago

@alefrol638 thank you for posting the issue. @sunnyosun can you have a look at the keyError problem, please?

mbuttner commented 2 years ago

Hi @alefrol638 Can you report the package versions that you have been using? I am particularly interested which readfcs version you used. Thanks!

alefrol638 commented 2 years ago

Ah yes, sorry forgot to attach the .yml. The version of readfcs is 1.0.3, the version of pytometry is 0.1.2.

name: pytometry_0.1.2
channels:
  - anaconda
  - defaults
dependencies:
  - backcall=0.2.0=pyhd3eb1b0_0
  - ca-certificates=2022.4.26=hecd8cb5_0
  - certifi=2022.6.15=py38hecd8cb5_0
  - decorator=5.1.1=pyhd3eb1b0_0
  - entrypoints=0.4=py38hecd8cb5_0
  - jedi=0.18.1=py38hecd8cb5_1
  - jupyter_client=7.2.2=py38hecd8cb5_0
  - jupyter_core=4.10.0=py38hecd8cb5_0
  - libcxx=14.0.6=h9765a3e_0
  - libffi=3.3=hb1e8313_2
  - libsodium=1.0.18=h1de35cc_0
  - ncurses=6.3=hca72f7f_3
  - nest-asyncio=1.5.5=py38hecd8cb5_0
  - openssl=1.1.1o=hca72f7f_0
  - parso=0.8.3=pyhd3eb1b0_0
  - pexpect=4.8.0=pyhd3eb1b0_3
  - pickleshare=0.7.5=pyhd3eb1b0_1003
  - pip=22.1.2=py38hecd8cb5_0
  - ptyprocess=0.7.0=pyhd3eb1b0_2
  - pure_eval=0.2.2=pyhd3eb1b0_0
  - python=3.8.13=hdfd78df_0
  - python-dateutil=2.8.2=pyhd3eb1b0_0
  - readline=8.1.2=hca72f7f_1
  - six=1.16.0=pyhd3eb1b0_1
  - sqlite=3.39.2=h707629a_0
  - stack_data=0.2.0=pyhd3eb1b0_0
  - tk=8.6.12=h5d9f67b_0
  - tornado=6.1=py38h9ed2024_0
  - wcwidth=0.2.5=pyhd3eb1b0_0
  - wheel=0.37.1=pyhd3eb1b0_0
  - xz=5.2.5=hca72f7f_1
  - zeromq=4.3.4=h23ab428_0
  - zlib=1.2.12=h4dc903c_3
  - pip:
    - anndata==0.8.0
    - appnope==0.1.3
    - asttokens==2.0.8
    - attrs==22.1.0
    - bokeh==2.4.3
    - charset-normalizer==2.1.1
    - click==8.1.3
    - cloudpickle==2.2.0
    - colorcet==3.0.0
    - cycler==0.11.0
    - dask==2022.9.0
    - datashader==0.14.2
    - datashape==0.5.2
    - debugpy==1.6.3
    - distributed==2022.9.0
    - executing==1.0.0
    - fastjsonschema==2.16.1
    - fcsparser==0.2.4
    - fonttools==4.37.1
    - fsspec==2022.8.2
    - h5py==3.7.0
    - heapdict==1.0.1
    - idna==3.3
    - importlib-metadata==4.12.0
    - importlib-resources==5.9.0
    - ipykernel==6.15.2
    - ipylab==0.6.0
    - ipython==8.5.0
    - ipywidgets==8.0.2
    - jinja2==3.1.2
    - joblib==1.1.0
    - jsonschema==4.16.0
    - jupyter-client==7.3.4
    - jupyter-core==4.11.1
    - jupyterlab-widgets==3.0.3
    - kiwisolver==1.4.4
    - lamin-logger==0.1.3
    - llvmlite==0.39.1
    - locket==1.0.0
    - loguru==0.6.0
    - markupsafe==2.1.1
    - matplotlib==3.5.3
    - matplotlib-inline==0.1.6
    - msgpack==1.0.4
    - multipledispatch==0.6.0
    - natsort==8.2.0
    - nbclient==0.6.8
    - nbformat==5.4.0
    - nbproject==0.5.2
    - nbproject-test==0.2.2
    - networkx==2.8.6
    - numba==0.56.2
    - numpy==1.23.2
    - orjson==3.8.0
    - packaging==21.3
    - pandas==1.4.4
    - param==1.12.2
    - partd==1.3.0
    - patsy==0.5.2
    - pillow==9.2.0
    - pkgutil-resolve-name==1.3.10
    - prompt-toolkit==3.0.31
    - psutil==5.9.2
    - pyct==0.4.8
    - pydantic==1.10.2
    - pygments==2.13.0
    - pynndescent==0.5.7
    - pyparsing==3.0.9
    - pyrsistent==0.18.1
    - pytometry==0.1.2
    - pytz==2022.2.1
    - pyyaml==6.0
    - pyzmq==23.2.1
    - readfcs==1.0.3
    - requests==2.28.1
    - scanpy==1.9.1
    - scikit-learn==1.1.2
    - scipy==1.9.1
    - seaborn==0.12.0
    - session-info==1.0.0
    - setuptools==59.8.0
    - sortedcontainers==2.4.0
    - stack-data==0.5.0
    - statsmodels==0.13.2
    - stdlib-list==0.8.0
    - tblib==1.7.0
    - threadpoolctl==3.1.0
    - toolz==0.12.0
    - tqdm==4.64.1
    - traitlets==5.3.0
    - typing-extensions==4.3.0
    - umap-learn==0.5.3
    - urllib3==1.26.12
    - widgetsnbextension==4.0.3
    - xarray==2022.6.0
    - zict==2.2.0
    - zipp==3.8.1
mbuttner commented 2 years ago

For now, you can work around this issue by setting reindex = False and manually adjust the marker names.

adata = pm.io.read_fcs(file, reindex=False)

Or by using the readfcs package (v. 1.0.3) directly:

adata = readfcs.read(file, reindex=False)

I hope that helps you.

sunnyosun commented 1 year ago

This issue has been fully resolved in readfcs==1.1.0, with this PR.

Thank you for reporting and providing a test file!

mbuttner commented 1 year ago

Terrific! Thank you!