greenelab / tybalt

Training and evaluating a variational autoencoder for pan-cancer gene expression data
BSD 3-Clause "New" or "Revised" License
162 stars 62 forks source link

Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. #155

Closed yagmuronay closed 3 years ago

yagmuronay commented 3 years ago

Dear Dr. Greg,

I was trying to run your process_data.py script in nbconverted scripts. This resulted in an error, which I believe was because of the missing column 'hugoSymbol' in tsg_df and also because that indexing with list with missing labels is deprecated. Please find the exact output below. The documentation says:

Changed in version 1.0.0.

Using .loc or [] with a list with one or more missing labels will no longer reindex, in favor of .reindex.

Could I simply use .reindex() instead of .loc() here in this case? I would be very grateful for your help with this issue. Thank you.

Kind regards, Yagmur


Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  status_gain = copy_gain_df.loc[:, oncogenes_df['hugoSymbol']]
/mnt/lsf-nas-1/os-shared/anaconda3/envs/tybalt/lib/python3.5/site-packages/pandas/core/indexing.py:1367: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)
tybalt/scripts/nbconverted/process_data.py:255: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  status_loss = copy_loss_df.loc[:, tsg_df['hugoSymbol']]
tybalt/scripts/nbconverted/process_data.py:270: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  copy_status = copy_status.loc[:, mutation_status.columns].fillna(0).astype(int)```
gwaybio commented 3 years ago

nice catch @yagmuronay - although I am not sure that your solution will work. It is possible that the underlying data the notebook loads has changed in some way.

response = requests.get('http://oncokb.org/api/v1/genes')
oncokb_df = pd.read_json(response.content)
oncokb_df.to_csv(oncokb_out_file, sep='\t')

# Integrate copy number, oncokb gene-type, and mutation status to define status matrix
oncogenes_df = oncokb_df[oncokb_df['oncogene']]
tsg_df = oncokb_df[oncokb_df['tsg']]

# Subset copy gains by oncogenes and copy losses by tumor suppressors (tsg)
status_gain = copy_gain_df.loc[:, oncogenes_df['hugoSymbol']]
status_loss = copy_loss_df.loc[:, tsg_df['hugoSymbol']]
copy_status = pd.concat([status_gain, status_loss], axis=1)

what does tsg_df look like?

yagmuronay commented 3 years ago

Dear Dr. Greg (@gwaygenomics),

thank you so much for your quick reply. The parameters tsg_df and oncogenes_df indeed have a column named "hugoSymbol". I found out later that the output was written as expected. Previousy, I looked into the wrong path, which was for raw data. All in all, these were only some warnings. Thank you so much for your time!