scverse / muon

muon is a multimodal omics Python framework
https://muon.scverse.org/
BSD 3-Clause "New" or "Revised" License
214 stars 30 forks source link

error in function ac.pl.fragment_histogram(atac, region=''") #51

Closed Raselel closed 1 year ago

Raselel commented 2 years ago

Hello, thanks for muon!

I was trying to reproduce the steps described in the ATAC tutorial (https://muon-tutorials.readthedocs.io/en/latest/single-cell-rna-atac/pbmc10k/2-Chromatin-Accessibility-Processing.html) on my independent dataset, but I get an error from the ac.pl.fragment_histogram function.

The code:

import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
import scipy
import anndata2ri

## Plotting utils
import matplotlib.pyplot as plt
import matplotlib
sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=150, color_map='viridis', transparent=True,
                              ipython_format="png2x")
sc.logging.print_header()

import muon as mu
# Import a module with ATAC-seq-related functions
from muon import atac as ac

atac = mu.read(path + "output/file_postrnatutorial.h5mu")

atac.var['mt'] = atac.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'

sc.pp.calculate_qc_metrics(atac, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
mu.pp.filter_var(atac, 'n_cells_by_counts', lambda x: x >= 10)
mu.pp.filter_obs(atac, 'n_genes_by_counts', lambda x: (x >= 2000) & (x <= 40000))
mu.pp.filter_obs(atac, 'total_counts', lambda x: (x >= 4000) & (x <= 100000))
atac.obs['NS']=1
ac.pl.fragment_histogram(atac, region='chr1:9808-10702')

The error is:

ValueError                                Traceback (most recent call last)
<ipython-input-16-0ef9424cd717> in <module>
----> 1 ac.pl.fragment_histogram(K29TIL_UD_atac, region='chr1:9808-10702')

/usr/local/lib/python3.6/dist-packages/muon/_atac/plot.py in fragment_histogram(data, region, groupby, barcodes)
    372     else:
    373         # Handle sns.distplot deprecation and sns.histplot addition
--> 374         g = hist(fragments.length, **kwargs)
    375         g.set_xlabel("Fragment length (bp)")
    376     g.set(xlim=(0, 1000))

/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py in histplot(data, x, y, hue, weights, stat, bins, binwidth, binrange, discrete, cumulative, common_bins, common_norm, multiple, element, fill, shrink, kde, kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws, palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)
   1434             estimate_kws=estimate_kws,
   1435             line_kws=line_kws,
-> 1436             **kwargs,
   1437         )
   1438 

/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py in plot_univariate_histogram(self, multiple, element, fill, common_norm, common_bins, shrink, kde, kde_kws, color, legend, line_kws, estimate_kws, **plot_kws)
    435 
    436             # Do the histogram computation
--> 437             heights, edges = estimator(observations, weights=weights)
    438 
    439             # Rescale the smoothed curve to match the histogram

/usr/local/lib/python3.6/dist-packages/seaborn/_statistics.py in __call__(self, x1, x2, weights)
    369         """Count the occurrances in each bin, maybe normalize."""
    370         if x2 is None:
--> 371             return self._eval_univariate(x1, weights)
    372         else:
    373             return self._eval_bivariate(x1, x2, weights)

/usr/local/lib/python3.6/dist-packages/seaborn/_statistics.py in _eval_univariate(self, x, weights)
    346         bin_edges = self.bin_edges
    347         if bin_edges is None:
--> 348             bin_edges = self.define_bin_edges(x, weights=weights, cache=False)
    349 
    350         density = self.stat == "density"

/usr/local/lib/python3.6/dist-packages/seaborn/_statistics.py in define_bin_edges(self, x1, x2, weights, cache)
    264 
    265             bin_edges = self._define_bin_edges(
--> 266                 x1, weights, self.bins, self.binwidth, self.binrange, self.discrete,
    267             )
    268 

/usr/local/lib/python3.6/dist-packages/seaborn/_statistics.py in _define_bin_edges(self, x, weights, bins, binwidth, binrange, discrete)
    252         elif binwidth is not None:
    253             step = binwidth
--> 254             bin_edges = np.arange(start, stop + step, step)
    255         else:
    256             bin_edges = np.histogram_bin_edges(

ValueError: arange: cannot compute length

The correct fragment file is stored in atac.uns['files'] and I've installed the relevant dependencies with pip install 'muon[atac]'. Any idea what the issue might be?

System

Thanks! Rasa

gtca commented 2 years ago

Hey @Raselel, thanks for reporting that.

Raselel commented 2 years ago

Thanks for a quick reply.

gtca commented 2 years ago

Thanks! Seaborn 0.11.1 seems to work for the PMBC10k and a few other datasets. I also tested the versions mentioned above in a clean environment (and pysam=0.18.0), and the following still seems to work:

import muon as mu
from muon import atac as ac

atac = mu.read("brain3k_processed.h5mu/atac")
ac.pl.fragment_histogram(atac, region='chr1:9808-10702')
  1. Is it a problem with a particular region? I.e. would it work for a larger region or other regions in general?
  2. Are other fragments-related parts functional, e.g. nucleosome signal or TSS enrichment in that notebook?

Also maybe clarifying a few things about the code above will help to understand the issue better.

Is atac above actually a MuData or an AnnData object? While the processing makes sense for an AnnData object, the reading function probably spits out a MuData?

mu.read(path + "output/file_postrnatutorial.h5mu")

And then, if it's an AnnData with e.g. peak counts and chr1:9808-10702 is in .var, there are also probably no var_names in it starting with MT-?

atac.var['mt'] = atac.var_names.str.startswith('MT-')
Raselel commented 2 years ago
  1. I have tried a few regions, but none plot.
  2. The nucleosome part plotting looks odd too. Screen Shot 2022-02-23 at 7 24 50 PM

I think you are right about the object: the atac.var does not have genes in var_names

Screen Shot 2022-02-23 at 7 19 21 PM

This is the structure of the object, if helps.

Screen Shot 2022-02-23 at 7 19 07 PM

Should I read the file differently?

Thanks!

gtca commented 2 years ago

Hey @Raselel, to get back on this, I wasn't able to reproduce it on several scATAC-seq / multiome datasets. Was this issue resolved in any way? If not, should we figure out a way to share a portion of this dataset with us so that we can debug it?

connersk commented 1 year ago

I had this same bug when I had an ATAC anndata object where I removed the "-1" from the end of the barcodes within adata.obs.index (i.e. barcode name AAACAGCCAACTAACT-1 changed to AAACAGCCAACTAACT). Adding back in the "-1" to the names in adata.obs.index fixed the problem.

gtca commented 1 year ago

Thanks, @connersk, we'll try to reproduce that and fix it!

gtca commented 1 year ago

Indeed, the reason seems to have been that the cell identifiers were renamed. The fragments file contains cell identifiers as well, and together with the ones in the in-memory object, those are used to link fragment -> cell information.

The solution for this is already implemented however: functions such as ac.pl.fragment_histogram and ac.tl.nucleosome_signal have the barcodes parameter. Generally, it makes sense to keep the original barcode in the metadata, and with barcodes="orig_barcode" it can be used for fragments-related functionality.

This will be available for all the functions operating on fragments in the next versions.