Open Chengwei94 opened 2 years ago
Hey @Chengwei94,
Do you think there's more information you could provide so that we can figure out what's up?
Is this about the latent space that you plot (UMAP) not looking good enough compared to the expectation? What data type is that? (I am not sure about strand-specific chromatin accessibility for instance. Is it a fragments file with strand information?) What features are used to aggregate fragments here (e.g. full gene bodies, gene starts, etc.)?
More importantly, this might be a good start for a discussion of counting strategies. Would you want to count cut sites or fragments? If the latter, what constitutes a fragment belonging to a region (full overlap, partial overlap)?
Might be that the way you count fragments for the second plot is different from what's been currently implemented. I'll try to shed light on the latter (I'll use barcodes and a fragments file from here):
❯ zcat < atac_fragments.tsv.gz | head -4
chr1 10066 10478 TCAAGAACAGTAATAG-1 1
chr1 10072 10191 AACCCGCAGGTAGCTT-1 1
chr1 10073 10340 CGCATATAGGTTACGT-1 2
chr1 10079 10192 GTAAGGTCACCCACAG-1 1
import pandas as pd
from muon import AnnData, atac as ac
obs = pd.read_csv("filtered_feature_bc_matrix/barcodes.tsv.gz", header=None).set_index(0)
adata = AnnData(obs=obs)
adata.uns['files'] = {'fragments': 'atac_fragments.tsv.gz'}
features = pd.DataFrame({"Chromosome": ["chr1"], "Start": [10_000], "End": [10_070]})
ac.tl.count_fragments_features(adata, features=features).X.sum()
# => 1.0
features = pd.DataFrame({"Chromosome": ["chr1"], "Start": [10_000], "End": [10_078]})
ac.tl.count_fragments_features(adata, features=features).X.sum()
# => 4.0
Hi @gtca,
yep its mainly with the gene activity matrix. I tried using + and - separately with muon atac and it didnt manage to work. My guess is with the original use of sparse lil matrix, I am not sure if it sums up the fragments of each cell correclty (I am not familiar with the use of lil matrix though). The data is the pbmc 10k which is the one you are using. I used the original settings which is gene body + 2k and the 10x gtf file to define the gene intervals
I dont have much opinion on the cutting strategy, but I dont think that any packages implemented cut-site fragment counting yet, so I would be interested to see how it performs if it is an option too.
I changed it something along these lines and it worked for me. (at least the gene activity look similar to the one of seurat gene activity which also count fragment + 2000 promoter length)
` score = [] cell_idx = [] feature_idx = []
for i in range(n_features):
# cell_count ={k: v for k, v in zip(range(atac.shape[0]), np.repeat(0, atac.shape[0]))}
f = df.iloc[i, :]
# Swap downstream and upstream if strand is negative
if f.strand == "-":
extend_upstream, extend_downstream = extend_downstream, extend_upstream
tss = f.end
else:
tss = f.start
for fr in tbx.fetch(
f.seqname,
f.start - extend_upstream,
f.end + extend_downstream,
multiple_iterators=False,
):
try:
cell_idx.append(d[fr.name])
score.append(fr.score)
feature_idx.append(i)
except:
pass
tbx.close()
mx = coo_matrix((score, (cell_idx, feature_idx)), shape=(n, n_features))
mx = mx.tocsr()`
Hi there,
I think that there is some bug in the count_fragments_features atac features. The gene activity I created from those are very bad looking. This is issue is separate from the other issue where the strand issue is considered. I think the issue might have been with the use of the lil matrix.
The first is with the muon function ,while the second picture is the one which i construct it with a coo matrix(with the strand fix).