scverse / muon

muon is a multimodal omics Python framework
https://muon.scverse.org/
BSD 3-Clause "New" or "Revised" License
218 stars 31 forks source link

muon.atac.tl.count_fragments_features atac bug #68

Open Chengwei94 opened 2 years ago

Chengwei94 commented 2 years ago

Hi there,

I think that there is some bug in the count_fragments_features atac features. The gene activity I created from those are very bad looking. This is issue is separate from the other issue where the strand issue is considered. I think the issue might have been with the use of the lil matrix.

The first is with the muon function ,while the second picture is the one which i construct it with a coo matrix(with the strand fix).

image

image

gtca commented 2 years ago

Hey @Chengwei94,

Do you think there's more information you could provide so that we can figure out what's up?

Is this about the latent space that you plot (UMAP) not looking good enough compared to the expectation? What data type is that? (I am not sure about strand-specific chromatin accessibility for instance. Is it a fragments file with strand information?) What features are used to aggregate fragments here (e.g. full gene bodies, gene starts, etc.)?

More importantly, this might be a good start for a discussion of counting strategies. Would you want to count cut sites or fragments? If the latter, what constitutes a fragment belonging to a region (full overlap, partial overlap)?

Might be that the way you count fragments for the second plot is different from what's been currently implemented. I'll try to shed light on the latter (I'll use barcodes and a fragments file from here):

❯ zcat < atac_fragments.tsv.gz | head -4
chr1    10066   10478   TCAAGAACAGTAATAG-1      1
chr1    10072   10191   AACCCGCAGGTAGCTT-1      1
chr1    10073   10340   CGCATATAGGTTACGT-1      2
chr1    10079   10192   GTAAGGTCACCCACAG-1      1
import pandas as pd
from muon import AnnData, atac as ac

obs = pd.read_csv("filtered_feature_bc_matrix/barcodes.tsv.gz", header=None).set_index(0)
adata = AnnData(obs=obs)
adata.uns['files'] = {'fragments': 'atac_fragments.tsv.gz'}

features = pd.DataFrame({"Chromosome": ["chr1"], "Start": [10_000], "End": [10_070]})
ac.tl.count_fragments_features(adata, features=features).X.sum()
# => 1.0

features = pd.DataFrame({"Chromosome": ["chr1"], "Start": [10_000], "End": [10_078]})
ac.tl.count_fragments_features(adata, features=features).X.sum()
# => 4.0
Chengwei94 commented 2 years ago

Hi @gtca,

yep its mainly with the gene activity matrix. I tried using + and - separately with muon atac and it didnt manage to work. My guess is with the original use of sparse lil matrix, I am not sure if it sums up the fragments of each cell correclty (I am not familiar with the use of lil matrix though). The data is the pbmc 10k which is the one you are using. I used the original settings which is gene body + 2k and the 10x gtf file to define the gene intervals

I dont have much opinion on the cutting strategy, but I dont think that any packages implemented cut-site fragment counting yet, so I would be interested to see how it performs if it is an option too.

I changed it something along these lines and it worked for me. (at least the gene activity look similar to the one of seurat gene activity which also count fragment + 2000 promoter length)

` score = [] cell_idx = [] feature_idx = []

for i in range(n_features):
    # cell_count ={k: v for k, v in zip(range(atac.shape[0]), np.repeat(0, atac.shape[0]))}
    f = df.iloc[i, :]
    # Swap downstream and upstream if strand is negative
    if f.strand == "-":
        extend_upstream, extend_downstream = extend_downstream, extend_upstream
        tss = f.end
    else:
        tss = f.start

    for fr in tbx.fetch(
        f.seqname,
        f.start - extend_upstream,
        f.end + extend_downstream,
        multiple_iterators=False,
    ):

        try:
            cell_idx.append(d[fr.name])
            score.append(fr.score)
            feature_idx.append(i)
        except:
            pass

tbx.close()

mx = coo_matrix((score, (cell_idx, feature_idx)), shape=(n, n_features))
mx = mx.tocsr()`