katzlabbrandeis / blech_clust

GNU General Public License v3.0
7 stars 4 forks source link

blech_run_QA error #176

Open emmalala123 opened 5 months ago

emmalala123 commented 5 months ago

Not sure if this is normal or a me thing:

Running Drift test Processing : /media/cmazzio/storage/eb_ephys/EB18_behandephys_5_21_cue_align/ Traceback (most recent call last): File "utils/qa_utils/drift_check.py", line 92, in spike_trains = get_spike_trains(metadata_handler.hdf5_name) File "utils/qa_utils/drift_check.py", line 42, in get_spike_trains dig_ins = hf5.list_nodes('/spike_trains') File "/home/cmazzio/miniconda3/envs/blech_clust/lib/python3.8/site-packages/tables/file.py", line 1962, in list_nodes group = self.get_node(where) # Does the parent exist? File "/home/cmazzio/miniconda3/envs/blech_clust/lib/python3.8/site-packages/tables/file.py", line 1607, in get_node node = self._get_node(nodepath) File "/home/cmazzio/miniconda3/envs/blech_clust/lib/python3.8/site-packages/tables/file.py", line 1556, in _get_node node = self._node_manager.get_node(nodepath) File "/home/cmazzio/miniconda3/envs/blech_clust/lib/python3.8/site-packages/tables/file.py", line 417, in get_node node = self.node_factory(key) File "/home/cmazzio/miniconda3/envs/blech_clust/lib/python3.8/site-packages/tables/group.py", line 1137, in _g_load_child node_type = self._g_check_has_child(childname) File "/home/cmazzio/miniconda3/envs/blech_clust/lib/python3.8/site-packages/tables/group.py", line 375, in _g_check_has_child raise NoSuchNodeError( tables.exceptions.NoSuchNodeError: group / does not have a child named /spike_trains

abuzarmahmood commented 5 months ago

For reference, the QA step is optional and doesn't impact downstream processing.

But it is weird that it can't find spike trains. Could you check in your HDF5 if you have spike trains? You could use either vitables or hdfview. Or if you have h5dump you could try h5dump -n <path to file> | egrep spike_train and check if you get an output.

If you don't have spike trains, try rerunning blech_make_arrays.py and check for errors.

emmalala123 commented 5 months ago

Ran it after blech_make_arrays.py, and got following error

==============================
Running QA tests on Blech data
Directory: /media/cmazzio/storage/eb_ephys/EB18_behandephys_5_21_cue_align

Running Similarity test
Processing : /media/cmazzio/storage/eb_ephys/EB18_behandephys_5_21_cue_align/
==================
Similarity calculation starting
Similarity cutoff ::: 50
32it [00:06,  4.96it/s]
Similarity calculation complete, results being saved to file
==================

Running Drift test
Processing : /media/cmazzio/storage/eb_ephys/EB18_behandephys_5_21_cue_align/
/home/cmazzio/miniconda3/envs/blech_clust/lib/python3.8/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order, subok=True)
Traceback (most recent call last):
  File "utils/qa_utils/drift_check.py", line 122, in <module>
    zscore_binned_spike_trains = [zscore(x, axis=-1) for x in plot_spike_trains]
  File "utils/qa_utils/drift_check.py", line 122, in <listcomp>
    zscore_binned_spike_trains = [zscore(x, axis=-1) for x in plot_spike_trains]
  File "/home/cmazzio/miniconda3/envs/blech_clust/lib/python3.8/site-packages/scipy/stats/_stats_py.py", line 2730, in zscore
    return zmap(a, a, axis=axis, ddof=ddof, nan_policy=nan_policy)
  File "/home/cmazzio/miniconda3/envs/blech_clust/lib/python3.8/site-packages/scipy/stats/_stats_py.py", line 2876, in zmap
    contains_nan, nan_policy = _contains_nan(a, nan_policy)
  File "/home/cmazzio/miniconda3/envs/blech_clust/lib/python3.8/site-packages/scipy/stats/_stats_py.py", line 97, in _contains_nan
    contains_nan = np.isnan(np.sum(a))
  File "<__array_function__ internals>", line 5, in sum
  File "/home/cmazzio/miniconda3/envs/blech_clust/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2241, in sum
    return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
  File "/home/cmazzio/miniconda3/envs/blech_clust/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: operands could not be broadcast together with shapes (413,) (427,) 

Finished QA tests
==============================
abuzarmahmood commented 5 months ago

Thanks for pointing out these bugs. The QA tests need to be moved to after make_arrays in the flow-chart. And it seems like there's another issue with drift_check from your latest commit. I'll have a look.

Mraymon5 commented 3 months ago

Is there a definitive sketch of the flowchart, or an otherwise detailed walk-through? I might be able to figure out the nomnoml.com code to make a new one, but I'm not sure I'd have the right outline.

abuzarmahmood commented 3 months ago

The flowchart was supposed to be definitive 😅 I've updated it to have QA after make arrays in the above branch. I'll go ahead and merge it. Please reopen the issue if the problem persists or if I missed something. Thank you.

Mraymon5 commented 2 months ago

I've just gotten back around to this part of the pipe, and I'm having the same problem as the 2nd issue @emmalala123 describes:

Running Drift test
Processing : /home/ramartin/Documents/MAR_Data/MR03/MR03_BAT_Tastes_Day6_240526_131433/
/home/ramartin/anaconda3/envs/blech_test/lib/python3.8/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order, subok=True)
Traceback (most recent call last):
  File "utils/qa_utils/drift_check.py", line 122, in <module>
    zscore_binned_spike_trains = [zscore(x, axis=-1) for x in plot_spike_trains]
  File "utils/qa_utils/drift_check.py", line 122, in <listcomp>
    zscore_binned_spike_trains = [zscore(x, axis=-1) for x in plot_spike_trains]
  File "/home/ramartin/anaconda3/envs/blech_test/lib/python3.8/site-packages/scipy/stats/stats.py", line 2410, in zscore
    contains_nan, nan_policy = _contains_nan(a, nan_policy)
  File "/home/ramartin/anaconda3/envs/blech_test/lib/python3.8/site-packages/scipy/stats/stats.py", line 257, in _contains_nan
    contains_nan = np.isnan(np.sum(a))
  File "<__array_function__ internals>", line 5, in sum
  File "/home/ramartin/anaconda3/envs/blech_test/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2241, in sum
    return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
  File "/home/ramartin/anaconda3/envs/blech_test/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: operands could not be broadcast together with shapes (210,) (630,)

So I'm reopening this, and I'll start trying to dig into that.

Mraymon5 commented 2 months ago

I think I'm starting to get the shape of the problem.

plot_spike_trains is a list, and each level of that list seems to be a tuple, and each tuple contains two (always? sometimes?) ndarrays. When python tries to run zscore(), it tries to combine the two arrays in order to zscore() them together, but they're (always? sometimes?) different sizes, so it fails.

For me, at least in the data set I'm handling, the length of plot_spike_trains is 18, which is the same as the # of saved units I have, so I'm guessing* that each position of plot_spike_trains contains a tuple that corresponds to 1 saved unit. Furthermore, at least in my data, all 18 of those tuples contain exactly 2 ndarrays. While I feel somewhat comfortable guessing that each tuple corresponds to a saved unit, I have no idea what the two ndarrays correspond to, or how I should be processing them.

My instinct is to sort of match the form of the source: plot_spike_trains is a list of tuples of arrays, so I'm inclined to break each tuple down into its component arrays, zscore each array individually, and then pack them back up into that 2x18 sort of structure, so that zscore_binned_spike_trains is a list of 18 tuples, each of which contains 2 ndarrays, each of which have been individually zscored.

But I'm really not confident that is the right answer. I think there are two essential questions that I have about the data though: 1: Is each tuple supposed to have 2 arrays in it, or is one of them an accident? 2: If there are supposed to be 2 arrays, do we want to zscore and store both of them back in zscore_binned_spike_trains, or only one of them? Like, I could imagine that the first array in each tuple is data of type A, and the second array is data of type B, and we only want data of type A to go into zscore_binned_spike_trains. But I have no idea.

Mraymon5 commented 2 months ago

Okay, I looked a little closer, and I think I've actually figured it out! The structure of plot_spike_trains is: A list of length [# of saved units], where each element of the list is a tuple of length [# of stimuli], and each element of the tuple is an array that contains spike train data for one stimulus for one saved unit.

Now that I think I understand the data, at least somewhat, I think I can move forward on fixing the issue.

Mraymon5 commented 2 months ago

Even further characterization of the problem: at least in my case, part of the problem is that I don't have an equal number of stimulus presentations; one of the arrays is 210 (30 trials x 7 samples per trial post binning), while the other is 630 (90 trials x 7 samples per trial post binning). I'm guessing the error @emmalala123 got had exactly the same root; 59 trails of 1 stimulus vs 61 trials of the other, giving 413 and 427 bins.

This explains why we're the only ones who've run into it; for IOC data, it makes sense that every stimulus would (usually) have the same # of trials, though I can image scenarios where that might not be the case. In behavioral data, animals can skip trials, so inevitably there will be uneven #s of stimulus presentations.

So now I'm 100% sure I know what the problem is, and I'm not sure what the most appropriate solution is, in terms of reconciling the nature of the data with the intended behavior of the analysis.

abuzarmahmood commented 2 months ago

Since QA testing is not "necessary" for pipeline function, I'm going to first focus on #82 so it's easier to troubleshoot uneven trial related issues in the future. Right now, I'm going to fish for a dataset with uneven trials (I think I have one, but if I can't find it, I'll ask you for yours). Once that is in place, I can work through this error. Thanks for figuring out where the problem is!

Mraymon5 commented 2 months ago

I did also figure out a fix for this particular problem. I think there's probably a lot of different solutions, but I just had it pad out uneven trail numbers with NA values to square the arrays up for plotting, etc. See: #207