Coverage Plots using 10X data and a custom genome

genomaxx commented 4 years ago

Hello there,

I am not sure if this will eventually progress into a bug report (perhaps I have just misunderstood some usage details) but I have elected to list this as an analysis question for now.

Background I have two samples of 10x scATACseq data that were aligned against a custom reference genome (two added tracks for inserted constructs) using cellranger. These were subsequently combined using the aggr command from cellranger and so the usual suspects are missing from the metadata (enhancer region fragments, promoter region fragments etc.).

Issue When I try to run this sample through the pmbc vignette workflow with all the default values unchanged (I don't however subset the seurat object for qc) I am unable to create a coverage plot at the end. If I substitute in the pmbc data, it works no problem. The specific issue with the plot is that the tracks are empty, no peaks anywhere. I have tried: A. Extending and reducing the upstream and downstream parameters B. Changing the downsampling (0.1-1) C. Not assigning the 'peaks' parameter and instead relying on the fragments file D. Searching for many different regions (some determined Differentially Accessible from a previous step in the workflow, and some not) E. Asking for tracks using the 'idents' parameter or using the 'group.by' parameter (after adding some metadata to group by)

Here is one of the many calls I have tried that works with the pbmc dataset but not with my own:

srObj can be substituted with pmbc

p <- CoveragePlot(
  object = srObj,
  region = c("chr6-157099063-157531913"),
  assay = 'peaks',
  sep = c(":", "-"),
  annotation = genes(EnsDb.Hsapiens.v75),
  downsample = 0.3,
  peaks = StringToGRanges(rownames(srObj), sep = c(":", "-")),
  extend.upstream = 5000,
  extend.downstream = 5000,
  idents = c('1', '3', '5'),
  # group.by = 'Library_ID',
  # fragment.path = '../data/atacRaw/DH21_P5_pair/fragments.tsv.gz',
  ncol = 1 
)

I know that there are peaks present in these regions I test, by simply indexing the seurat object for those regions and viewing the counts, or by using the cloupe browser provided by 10X.

Is there a simple way for me to test what piece of the coverage plots function is failing?

If there is anything else that I can provide that would help in identifying the issue I would be glad to do so. Thanks a bunch!

timoast commented 4 years ago

I think the issue could be that you need to set sep = c("-", "-") here (leave it as the default). These are the separators used to decode the region that you specify rather than the peaks in the object

genomaxx commented 4 years ago

I tried changing the sep value back and forth between "-", "-" and ":", "-". It doesn't make much difference to either the pbmc dataset (works with both) or my own dataset (same issue with either). I have attached an image of what the plot looks like.

pik3ca_coverage_error

timoast commented 4 years ago

There's probably a mismatch between the cell names in the Seurat object and the cell names in the fragment file. Are you using Read10X to read the count matrix, rather than Read10X_h5? Read10X will remove the trailing -1 if it's present in all cells which can cause problems for scATAC-seq (https://github.com/timoast/signac/issues/128).

Can you verify that the cell names are the same in the object and in the fragment file by looking at part of the fragment file (gzip -d fragments.tsv.gz | head) and comparing with the cell names in the Seurat object?

genomaxx commented 4 years ago

In my case I have a mix of cells ending in -1 and -2 because two samples were combined using the aggr function from 10X. Therefore I don't believe those suffixes are ever dropped. I double checked though, and I can consistently find cells from my seurat object in the fragments file. cmdline_output.txt

timoast commented 4 years ago

Is your file block gzip compressed and indexed?

genomaxx commented 4 years ago

Yes, I still have the fragments.tsv.gz.tbi file. I gunzip'd the file to check the contents. But I think I may have zipped it previously using gzip rather than bgzip. I have now compressed the file using bgzip but a new warning now appears while running the CoveragePlot function among others: [W::hts_idx_load2] The index file is older than the data file: Do you know if this is likely to be a problem? Also, same result after block gzip, I get an empty plot.

timoast commented 4 years ago

Yeah if you unzip it you'll probably need to re-index it after re-bgzipping it. You can run:

bgzip -@ 4 fragments.tsv
tabix -p bed fragments.tsv.gz

genomaxx commented 4 years ago

After re-indexing, the CoveragePlot function now works. Thanks for your help!

TLDR for all in future: It requires that the fragments file be block gzip compressed and indexed.

@timoast Do you think it would make sense for this behaviour to generate an error message relating to fragment files rather than an empty plot?

timoast commented 4 years ago

Yes it would be good to have an informative error message here, I'll add something in the future release

stuart-lab / signac

Coverage Plots using 10X data and a custom genome #138