mortazavilab / swan_vis

A Python library to visualize and analyze long-read transcriptomes
https://freese.gitbook.io/swan/
MIT License
57 stars 11 forks source link

MemoryError: Unable to allocate 113. GiB for an array with shape (9992, 3033596) and data type float32 #23

Closed catsargent closed 1 year ago

catsargent commented 1 year ago

Hi,

I am keen to use swanvis to explore the results of running TALON on our single cell dataset. Unfortunately, when reading in the filtered abundance information into the swan graph, I get a memory error.

Screenshot 2023-02-13 at 14 28 52

We have 9992 cells and 17,808 transcripts. The error is due to trying to create an array with shape (9992, 3033596). I am not sure what the 3033596 refers to. I tried increasing the memory allocation to 200Gb on the HPC but it still fails. Increasing the memory further does not work as I am not granted the resources for my job on the HPC. Do you have any suggestions about how to get around this?

Many thanks, Catherine

fairliereese commented 1 year ago

Hi, could you please paste the whole error message here? Thanks!

On Mon, Feb 13, 2023, 06:17 Catherine Sargent @.***> wrote:

Hi,

I am keen to use swanvis to explore the results of running TALON on our single cell dataset. Unfortunately, when reading in the filtered abundance information into the swan graph, I get a memory error. [image: Screenshot 2023-02-13 at 14 28 52] https://user-images.githubusercontent.com/38214629/218480192-36ddfca9-9e8f-4288-b699-db0acd65a1c9.png

We have 9992 cells and 17,808 transcripts. The error is due to trying to create an array with shape (9992, 3033596). I am not sure what the 3033596 refers to. I tried increasing the memory allocation to 200Gb on the HPC but it still fails. Increasing the memory further does not work as I am not granted the resources for my job on the HPC. Do you have any suggestions about how to get around this?

Many thanks, Catherine

— Reply to this email directly, view it on GitHub https://github.com/mortazavilab/swan_vis/issues/23, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBGIWPNABG3ULUQWLKBUADWXI67HANCNFSM6AAAAAAU2KHXQI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

catsargent commented 1 year ago

Sure. First of all there were lots of these warnings:

df[total_col] = df[c].sum()
/projects/b1177/pythonenvs/scanpy-env/lib/python3.9/site-packages/swan_vis/utils.py:456: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[cond_col] = (df[c]*1000000)/df[total_col]

And then this error:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[16], line 2
      1 # add each dataset's abundance information to the SwanGraph
----> 2 sg.add_abundance(ab_file)

File /projects/b1177/pythonenvs/scanpy-env/lib/python3.9/site-packages/swan_vis/swangraph.py:346, in SwanGraph.add_abundance(self, counts_file)
    343         self.adata.layers['pi'] = calc_pi(self.adata, self.t_df)[0].to_numpy()
    345 # add abundance for edges, TSS per gene, and TES per gene
--> 346 self.create_edge_adata()
    347 self.create_end_adata(kind='tss')
    348 self.create_end_adata(kind='tes')

File /projects/b1177/pythonenvs/scanpy-env/lib/python3.9/site-packages/swan_vis/swangraph.py:512, in SwanGraph.create_edge_adata(self)
    509 t_exp_df = pd.DataFrame(columns=obs, data=data, index=tid)
    511 # merge counts per transcript with edges
--> 512 edge_exp_df = edge_exp_df.merge(t_exp_df, how='left',
    513     left_index=True, right_index=True)
    515 # sum the counts per transcript / edge / dataset
    516 edge_exp_df = edge_exp_df.groupby('edge_id').sum()

File /projects/b1177/pythonenvs/scanpy-env/lib/python3.9/site-packages/pandas/core/frame.py:10090, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
  10071 @Substitution("")
  10072 @Appender(_merge_doc, indents=2)
  10073 def merge(
   (...)
  10086     validate: str | None = None,
  10087 ) -> DataFrame:
  10088     from pandas.core.reshape.merge import merge
> 10090     return merge(
  10091         self,
  10092         right,
  10093         how=how,
  10094         on=on,
  10095         left_on=left_on,
  10096         right_on=right_on,
  10097         left_index=left_index,
  10098         right_index=right_index,
  10099         sort=sort,
  10100         suffixes=suffixes,
  10101         copy=copy,
  10102         indicator=indicator,
  10103         validate=validate,
  10104     )

File /projects/b1177/pythonenvs/scanpy-env/lib/python3.9/site-packages/pandas/core/reshape/merge.py:124, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     93 @Substitution("\nleft : DataFrame or named Series")
     94 @Appender(_merge_doc, indents=0)
     95 def merge(
   (...)
    108     validate: str | None = None,
    109 ) -> DataFrame:
    110     op = _MergeOperation(
    111         left,
    112         right,
   (...)
    122         validate=validate,
    123     )
--> 124     return op.get_result(copy=copy)

File /projects/b1177/pythonenvs/scanpy-env/lib/python3.9/site-packages/pandas/core/reshape/merge.py:775, in _MergeOperation.get_result(self, copy)
    771     self.left, self.right = self._indicator_pre_merge(self.left, self.right)
    773 join_index, left_indexer, right_indexer = self._get_join_info()
--> 775 result = self._reindex_and_concat(
    776     join_index, left_indexer, right_indexer, copy=copy
    777 )
    778 result = result.__finalize__(self, method=self._merge_type)
    780 if self.indicator:

File /projects/b1177/pythonenvs/scanpy-env/lib/python3.9/site-packages/pandas/core/reshape/merge.py:766, in _MergeOperation._reindex_and_concat(self, join_index, left_indexer, right_indexer, copy)
    764 left.columns = llabels
    765 right.columns = rlabels
--> 766 result = concat([left, right], axis=1, copy=copy)
    767 return result

File /projects/b1177/pythonenvs/scanpy-env/lib/python3.9/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File /projects/b1177/pythonenvs/scanpy-env/lib/python3.9/site-packages/pandas/core/reshape/concat.py:381, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    159 """
    160 Concatenate pandas objects along a particular axis.
    161 
   (...)
    366 1   3   4
    367 """
    368 op = _Concatenator(
    369     objs,
    370     axis=axis,
   (...)
    378     sort=sort,
    379 )
--> 381 return op.get_result()

File /projects/b1177/pythonenvs/scanpy-env/lib/python3.9/site-packages/pandas/core/reshape/concat.py:616, in _Concatenator.get_result(self)
    612             indexers[ax] = obj_labels.get_indexer(new_labels)
    614     mgrs_indexers.append((obj._mgr, indexers))
--> 616 new_data = concatenate_managers(
    617     mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy
    618 )
    619 if not self.copy:
    620     new_data._consolidate_inplace()

File /projects/b1177/pythonenvs/scanpy-env/lib/python3.9/site-packages/pandas/core/internals/concat.py:212, in concatenate_managers(mgrs_indexers, axes, concat_axis, copy)
    210 values = blk.values
    211 if copy:
--> 212     values = values.copy()
    213 else:
    214     values = values.view()

MemoryError: Unable to allocate 113. GiB for an array with shape (9992, 3033596) and data type float32

Thanks for the quick response!

fairliereese commented 1 year ago

I'll see if there are any things I can do to decrease the memory usage at this step but in the meantime, have you filtered your cells and transcripts down to the final set that you're planning on using for the analysis? I definitely recommend doing this before using Swan!

catsargent commented 1 year ago

I had filtered down to the set of cells passing QC in the SR dataset and then had done the filtering of transcripts as mentioned in your paper i.e. for unknown transcripts, 1 count in minimum 4 cells and not flagged as internal priming.

To get around the memory issue, I have now filtered to just two genes and respective transcripts that we're particularly interested in.

fairliereese commented 1 year ago

Hi there,

Sorry it has taken me so long to respond. I've added a few new initialization options that might help your problem. By default, Swan generates expression matrices for your transcripts as well as TSSs, TESs, and individual exons, the last of which as you might imagine ends up having a lot of features.

The options I have added will turn off the creation of these expression matrices, if you don't need them, run your SwanGraph initialization as sg = swan.SwanGraph(sc=True, edge_adata=False, end_adata=False).

You'll have to install from the latest commits. Let me know you are able to give this a try!