kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
53 stars 17 forks source link

[Request] Improve Documentation for tabulate #63

Open gcohenJH opened 2 years ago

gcohenJH commented 2 years ago

Just a tiny issue. I was going through your meta-clonotype discovery and tabulation script, modifying it to work with my mouse data, but I kept having issues with the tabulation script. I got an error that tabulate requires a column named 'productive_frequency' in the bulk data, clone_df2, yet this is not specified in the documentation anywhere. You are sure to have the other required columns in the dataframe after using TCRrep (cdr3, v, j, and count). After adding a frequency column, it worked as expected!

Thank you so much for the very useful package!

kmayerb commented 2 years ago

HI gcohenJH,

Thanks for sharing your experience and area for improvement in the docs.

Out of curiosity did you happen to see the TCRjoin feature for tabulation. I think it will allow tabulation without a frequency column:

https://tcrdist3.readthedocs.io/en/latest/join.html

Could you add a snippet of where you got the error?

Thanks, k

gcohenJH commented 2 years ago

Sure. Here's the chunk where I'm getting the error. It's exactly the same as https://tcrdist3.readthedocs.io/en/latest/metaclonotypes.html . Preceding this is just the code where I load my bulk data and rename columns.

tr_search.cpus = 4
tic = time.perf_counter()
tr_search.compute_sparse_rect_distances(df = tr_search.clone_df, df2 = tr_bulk.clone_df, chunk_size = 50, radius = 50) 
results = tabulate(clone_df1 = tr_search.clone_df, clone_df2 = tr_bulk.clone_df, pwmat = tr_search.rw_beta)
toc = time.perf_counter()
print(f"TABULATED IN {toc - tic:0.4f} seconds")

Here's the error I was getting when I didn't include the productive_frequency column.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Miniconda3\envs\tcrdist3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~\Miniconda3\envs\tcrdist3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

~\Miniconda3\envs\tcrdist3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'productive_frequency'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13016/817900842.py in <module>
      2 tic = time.perf_counter()
      3 tr_search.compute_sparse_rect_distances(df = tr_search.clone_df, df2 = tr_bulk.clone_df, chunk_size = 50, radius = 50)
----> 4 results = tabulate(clone_df1 = tr_search.clone_df, clone_df2 = tr_bulk.clone_df, pwmat = tr_search.rw_beta)
      5 toc = time.perf_counter()
      6 print(f"TABULATED IN {toc - tic:0.4f} seconds")

~\Miniconda3\envs\tcrdist3\lib\site-packages\tcrdist\tabulate.py in tabulate(clone_df1, clone_df2, pwmat, cdr3_name, v_gene_name, j_gene_name)
     85         # Retrieve abundances from the bulk clone df
     86         icounts    = [clone_df2['count'].iloc[x].to_list()                 for x in icol]
---> 87         ifreqs     = [clone_df2['productive_frequency'].iloc[x].to_list()  for x in icol]
     88 
     89         isumcounts    = [np.sum(x) for x in icounts]

~\Miniconda3\envs\tcrdist3\lib\site-packages\tcrdist\tabulate.py in <listcomp>(.0)
     85         # Retrieve abundances from the bulk clone df
     86         icounts    = [clone_df2['count'].iloc[x].to_list()                 for x in icol]
---> 87         ifreqs     = [clone_df2['productive_frequency'].iloc[x].to_list()  for x in icol]
     88 
     89         isumcounts    = [np.sum(x) for x in icounts]

~\Miniconda3\envs\tcrdist3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~\Miniconda3\envs\tcrdist3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'productive_frequency'

As far as I can tell, tabulate is asking for a column named productive frequency in the df2, and pandas can't find that column in the dataframe so its giving an error.

join_by_dist seems more like what I would want for tabulation though. Thank you for the help.