Preparing inputs for bulk WGS + scATACseq

Hi there, Thanks so much for creating this package. I'm in the processing for trying it for one of my samples. I have bulk WGS and 10x scATACseq (also paired scRNAseq; outputs generated by cellranger-ARC count) datasets for this.

I used Vartrix to generate the SNP by cell matrices $VARTRIX -v $GERMLINE_VCF -b $SCATAC_BAM -f $ARC_REF -c $BARCODE --out-matrix $OUT_MTX --ref-matrix $REF_MTX -s "coverage" --threads 12

I then loaded them in R:

barcodes=read.table("Data/ArchR/scRNA/raw_feature_bc_matrix/E34/barcodes.tsv.gz", sep='\t', stringsAsFactors = F, header=F)
alt_all=readMM("Data/Alleloscope/SNPs/E34.alt_matrix.mtx")
ref_all=readMM("Data/Alleloscope/SNPs/E34.ref_matrix.mtx")
var_all=read.table("Data/Alleloscope/SNPs/E34.germline.vcf", header = F, sep='\t', stringsAsFactors = F)

>head(barcodes)
                  V1
1 AAACAGCCAAACAACA-1
2 AAACAGCCAAACATAG-1
3 AAACAGCCAAACCCTA-1
4 AAACAGCCAAACCTAT-1
5 AAACAGCCAAACCTTG-1
6 AAACAGCCAAACGCGA-1

> head(alt_all)
6 x 731466 sparse Matrix of class "dgTMatrix"

[1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
[2,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
[3,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
[4,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
[5,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
[6,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

 .....suppressing 731421 columns in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................

> head(ref_all)
6 x 731466 sparse Matrix of class "dgTMatrix"

[1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
[2,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
[3,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
[4,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
[5,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
[6,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

 .....suppressing 731421 columns in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................

> head(var_all)
    V1    V2           V3 V4 V5     V6   V7
1 chr1 13813 rs1213979446  T  G  85.60 PASS
2 chr1 13838   rs28428499  C  T  70.60 PASS
3 chr1 14464  rs546169444  A  T 405.60 PASS
4 chr1 14653   rs62635297  C  T  51.60 PASS
5 chr1 14671  rs201055865  G  C 414.60 PASS
6 chr1 15484 rs1236891946  G  A  73.60 PASS
                                                                                                                                                                                                                                                V8
1     AC=1;AF=0.5;AN=2;BaseQRankSum=-2.92;ClippingRankSum=0.19;DP=29;ExcessHet=3.0103;FS=22.753;MLEAC=1;MLEAF=0.5;MMQ=23,22;MQ=28.37;MQ0=0;MQRankSum=-1.901;NEGATIVE_TRAIN_SITE;QD=2.95;ReadPosRankSum=1.71;SOR=3.222;VQSLOD=-4.590e+00;culprit=QD
2 AC=1;AF=0.5;AN=2;BaseQRankSum=-0.199;ClippingRankSum=-0.083;DP=33;ExcessHet=3.0103;FS=27.668;MLEAC=1;MLEAF=0.5;MMQ=25,22;MQ=29.02;MQ0=0;MQRankSum=-2.147;NEGATIVE_TRAIN_SITE;QD=2.14;ReadPosRankSum=-0.276;SOR=3.22;VQSLOD=-4.602e+00;culprit=QD
3                                 AC=1;AF=0.5;AN=2;BaseQRankSum=-1.082;ClippingRankSum=-1.489;DP=75;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MMQ=24,40;MQ=31.63;MQ0=0;MQRankSum=0.981;QD=5.41;ReadPosRankSum=1.841;SOR=0.72;VQSLOD=0.949;culprit=QD
4   AC=1;AF=0.5;AN=2;BaseQRankSum=3.868;ClippingRankSum=-0.76;DP=73;ExcessHet=3.0103;FS=5.444;MLEAC=1;MLEAF=0.5;MMQ=24,24;MQ=27.81;MQ0=0;MQRankSum=-0.967;NEGATIVE_TRAIN_SITE;QD=0.71;ReadPosRankSum=-0.176;SOR=1.781;VQSLOD=-2.517e+00;culprit=QD
5    AC=1;AF=0.5;AN=2;BaseQRankSum=-4.74;ClippingRankSum=0.375;DP=87;ExcessHet=3.0103;FS=6.112;MLEAC=1;MLEAF=0.5;MMQ=24,24;MQ=26.58;MQ0=0;MQRankSum=0.028;NEGATIVE_TRAIN_SITE;QD=4.77;ReadPosRankSum=-0.486;SOR=0.276;VQSLOD=-2.118e+00;culprit=QD
6         AC=1;AF=0.5;AN=2;BaseQRankSum=0.5;ClippingRankSum=-1.245;DP=18;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MMQ=25,22;MQ=23.13;MQ0=0;MQRankSum=-1.002;NEGATIVE_TRAIN_SITE;QD=4.09;ReadPosRankSum=2.022;SOR=0.169;VQSLOD=-6.426e+00;culprit=QD
              V9                        V10
1 GT:AD:DP:GQ:PL    0/1:25,4:29:93:93,0,988
2 GT:AD:DP:GQ:PL   0/1:29,4:33:78:78,0,1141
3 GT:AD:DP:GQ:PL 0/1:56,19:75:99:413,0,1631
4 GT:AD:DP:GQ:PL   0/1:64,9:73:59:59,0,1360
5 GT:AD:DP:GQ:PL 0/1:59,28:87:99:422,0,1484
6 GT:AD:DP:GQ:PL    0/1:13,5:18:81:81,0,308

I created an Alleloscope object:

dir_path <- "Data/Alleloscope/output"
dir.create(dir_path)

Obj=Createobj(alt_all =alt_all, 
              ref_all = ref_all, 
              var_all = var_all ,
              samplename='E34', 
              genome_assembly="GRCh38", 
              dir_path=dir_path, 
              barcodes=barcodes, 
              size=size, 
              assay='scATACseq')

However when I tried to filter out cells and SNPs with few read counts i got an error:

> Obj_filtered=Matrix_filter(Obj=Obj, cell_filter=5, SNP_filter=5, min_vaf = 0.1, max_vaf = 0.9) 
31845 cells after filtering.
238654 SNPs after filtering.
(Recommend more than 1,000,000 SNPs for all chromosomes)
Object successfully filterd!
Plots for statistics have been saved in the path:Data/Alleloscope/outputplots/
Warning messages:
1: In order(as.numeric(as.character(var_list[, 1])), as.numeric(as.character(var_list[,  :
  NAs introduced by coercion
2: In Matrix_filter(Obj = Obj, cell_filter = 5, SNP_filter = 5, min_vaf = 0.1,  :
  NAs introduced by coercion
3: In var_pos + size_cs_rep :
  longer object length is not a multiple of shorter object length

Not sure what's causing the error, could you please have a look? I could send across example files if needed. Thanks! A

Hello, thanks for you interest in Alleloscope!

I think it is the problem for plotting. One possibility is that your alt_all/ref_all/var_all files have the SNPs not on chr1-22. Could you check if your var_all file using this code? table(var_all[,1])

If it did include chromosomes other than chr1~22, could you filter the matrix using the following codes and try if you still get the error? ind <- which(var_all[,1] %in% paste0('chr', 1:22)) alt_all=alt_all[ind,] ref_all=ref_all[ind,] var_all=var_all[ind,]

If this does not solve the issue, could you send the example files? I can help try this from my end.

Thanks! That seems to work! I'm running this now and i'm guessing that it will take a while to finish.

> Obj_filtered=Est_regions(Obj_filtered = Obj_filtered, max_nSNP = 30000, plot_stat = T,cont = FALSE)
Estimation starts.
1:0 1:145600000

If you don't mind if I ask you another question while we are here. I'm trying to create the bin x cell matrix using https://github.com/seasoncloud/Basic_CNV_SNV/blob/main/scripts/Gen_bin_cell_atac.R

I've generated the fixed 200kb hg38 bin:

> query
GRanges object with 14385 ranges and 0 metadata columns:
          seqnames            ranges strand
             <Rle>         <IRanges>  <Rle>
      [1]     chr1          1-200000      *
      [2]     chr1     200001-400000      *
      [3]     chr1     400001-600000      *
      [4]     chr1     600001-800000      *
      [5]     chr1    800001-1000000      *
      ...      ...               ...    ...
  [14381]    chr22 50000001-50200000      *
  [14382]    chr22 50200001-50400000      *
  [14383]    chr22 50400001-50600000      *
  [14384]    chr22 50600001-50800000      *
  [14385]    chr22 50800001-50818468      *

Here's some details about my files:

> head(barcodes)
[1] "AAACAGCCAAACAACA-1" "AAACAGCCAAACATAG-1" "AAACAGCCAAACCCTA-1" "AAACAGCCAAACCTAT-1"
[5] "AAACAGCCAAACCTTG-1" "AAACAGCCAAACGCGA-1"

> cat(paste0("Total ", length(fragments)," fragments.\n"))
Total 146782646 fragments.

> cat(paste0("Total ", length(fragments_incell)," fragments in cell.\n")) 
Total 146782646 fragments in cell.

I used the script you created Gen_bin_cell_atac.R but it returned an error, suggesting that the table is too large.

> mm=table(ov[,1],ov[,3])
Error in base::table(...) : attempt to make a table with >= 2^31 elements

I suspect that because my file is too large, aka there's too many fragments.

Here's links to my files: fragment: https://drive.google.com/file/d/1T9JRa5J7yORtsdv1v768ycmiFyao4THd/view?usp=sharing index: https://drive.google.com/file/d/1gY-xTr_Amt5v5j1iM9XUgFlALTzv0heX/view?usp=sharing barcodes: https://drive.google.com/file/d/1rRCEtjtYFw2mSd72j7zbpsbeWkqjYpc_/view?usp=sharing

I wonder if https://satijalab.org/signac/reference/genomebinmatrix gives the same output format as needed for Alleloscope?

Yeah possibly. I think the table function I used in this helper function cannot handle this large file.

I just checked the barcode files. There are 731466 barcodes. I typically use the filtered barcode files (filtering out barcodes that might not be real cells) to have about several thousands of barcodes per sample. There is a "filtered_peak_bc_matrix" output folder from cellranger. Perhaps you could try using the barcodes.tsv file from this?

I haven't used 'GenomeBinMatrix' before, but the output should be similar at a first glance. Alleloscope just requires a bin by cell matrix with row name in this format "chr1-1-20000" and the order of the columns (Each column is a cell.) should be the same as that in the barcodes.tsv.

Alternatively, you could use the "Gen_bin_cell_atac.R" to generate the matrix for each chromosome, and then match the columns by the barcode names. I can help with this if needed, but might take some time.

Thanks for the offer! For now I've managed to use https://satijalab.org/signac/reference/genomebinmatrix to create a bin x cell matrix.

> head(raw_counts)
6 x 571152 sparse Matrix of class "dgCMatrix"
   [[ suppressing 45 column names ‘GATCAGTTCCTGGTGA-1’, ‘TTTGACTTCATTGTCT-1’, ‘AGGATTGAGTAGGATG-1’ ... ]]

chr1-1-200000        1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1
chr1-200001-400000   . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . .
chr1-400001-600000   . . . . . . . . . . . . . . . .  . . . . 1 . . . . . . . . . . . . . . . . . . .
chr1-600001-800000   . . . . . . . . . . . . . . . .  . 1 . . 2 . . 1 1 . . . . . . . . . . . . 1 . 1
chr1-800001-1000000  . 4 3 . 4 5 . 4 . . . 4 3 1 . 5  8 1 . 1 3 . . 1 3 1 5 . 1 1 . 1 1 4 . 1 5 1 . 7
chr1-1000001-1200000 3 5 5 2 1 4 3 3 . . 2 7 2 1 7 5 15 3 2 4 3 4 1 3 4 6 3 . . 8 1 1 3 1 3 3 6 3 . 4

chr1-1-200000        1 1 1 1  1 ......
chr1-200001-400000   . . . .  . ......
chr1-400001-600000   . . . .  . ......
chr1-600001-800000   . . . 1  1 ......
chr1-800001-1000000  . 2 . 1  8 ......
chr1-1000001-1200000 . 3 2 3 12 ......

 .....suppressing 571107 columns in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................
>

But noticed that this format is different that in your example "chr200k_fragments_sub.txt". I tried converting it to matrix but i guess again the size is too large. Next step for me is to use the filtered barcode file as you said.

On another note, is cell type annotation required? My sample mainly consists of cancer cells with no "normal cells", I am not sure how to proceed with

Obj_filtered$select_normal$barcode_normal=cell_type[which(cell_type[,2]!='tumor'),1]

and

Obj_filtered=Genotype_value(Obj_filtered = Obj_filtered, type='tumor', raw_counts=raw_counts, cov_adj=1) # for tumor

I tried using without "type" option

> Obj_filtered=Genotype_value(Obj_filtered = Obj_filtered, raw_counts=raw_counts, cov_adj=1) 
Start estimating cell specific (rho_hat, theta_hat) for each region.
Error in intI(j, n = d[2], dn[[2]], give.dn = FALSE) : 
  index larger than maximal 571152

Hello, Yes I would suggest using the filtered barcodes since many of the barcodes in the raw data are not real cells.

As far as I know, most copy number detection methods require some cells to be used as control. You could use typical scATAC-seq processing pipeline to see if you can find some normal cells. You don't need to specify all the normal cells. Only a few is enough.

Is your data a tumor cell line (so you said mainly consisting tumor cells)? If that is the case, are you able to add some normal cells from another dataset (with the same experimental process and data processing pipeline) as control?

Hi @seasoncloud, I am indeed using a cell line so I have no "contaminating" normal cells. I've found a control dataset here. These are processed with the same pipeline. How do I integrate this with the separate alleloscope steps?

I see. Yeah I think you could try using this public control dataset. An example for the scDNA-seq cell line dataset can be found here. This should be similar for scATAC-seq.

The related part in this page is

"Prepare input files"-- "Bin by cell (sparse) matrices for tumor and normal samples"--"Without paired normal sample, other normal samples aligned to the same reference genome (eg. GRCh38) also work if with matched bins.". So you will need to have a matrix with the same number of rows (bins) for the control dataset as that for the cell line dataset you generated using your method. (cell number does not need to be the same).
For "Step5. Genotype each cell in each region"-- "Estimate cell-specific (rho_hat, theta_hat) values for each region."--Obj_filtered=Genotype_value(Obj_filtered = Obj_filtered, type='cellline', raw_counts=raw_counts, ref_counts = ref_counts ) # for cell line without normal cells in the tumor sample. The bin by cell matrix for the control dataset (ref_counts) can be input here.

Hope this makes sense to you and let me know if you have question.

Hi @seasoncloud, Thank you for staying with me on this- I've gone ahead and tried a few things. Here's my workflow:

Prepare bin x cell matrices:

raw_counts <- readRDS("Alleloscope/E34_bin_cell_atac_fragments.rds")
ref_counts <- readRDS("Alleloscope/human_brain_3k_bin_cell_atac_fragments.rds")
ref_counts <- ref_counts[rownames(ref_counts) %in% rownames(raw_counts),]
raw_counts <- raw_counts[rownames(raw_counts) %in% rownames(ref_counts),] 
colnames(ref_counts) <- paste0("HB_", colnames(ref_counts))
colnames(raw_counts) <- paste0("E34_", colnames(raw_counts))

Read in SNP x cell matrices from Vartrix:

barcodes=read.table("Cellranger_arc/filtered_feature_bc_matrix/E34/barcodes.tsv.gz", sep='\t', stringsAsFactors = F, header=F)
barcodes <- as.data.frame(paste0("E34_",barcodes$V1))
alt_all=readMM("Alleloscope/SNPs/E34.alt_matrix.mtx")
ref_all=readMM("Alleloscope/SNPs/E34.ref_matrix.mtx")
var_all=read.table("Alleloscope/SNPs/E34.germline.vcf.gz", header = F, sep='\t', stringsAsFactors = F)

Create alleloscope object:

Obj=Createobj(alt_all =alt_all, 
              ref_all = ref_all, 
              var_all = var_all ,
              samplename='E34', 
              genome_assembly="GRCh38", 
              dir_path=dir_path, 
              barcodes=barcodes, 
              size=size, 
              assay='scATACseq',
              cov = FALSE)

Filter out cells/SNPs with few read counts:

Obj_filtered=Matrix_filter(Obj=Obj, cell_filter=5, SNP_filter=5, 
                           min_vaf = 0.1, max_vaf = 0.9,
                           plot_stat = T,
                           plot_vaf = T)

Segmentation using WGS tumour and blood control:

WGSt=read.table("Alleloscope/segs/E34.hg38.100Kb.windows.counts.tumor.bedg", stringsAsFactors = F, sep='\t') ## Load tumour seg
WGSn=read.table("Alleloscope/segs/E34.hg38.100Kb.windows.counts.germline.bedg", stringsAsFactors = F, sep='\t') ## Load normal seg
WGSt=WGSt[which(WGSt$V1 %in% paste0('chr',1:22)), ]
WGSn=WGSn[which(WGSn$V1 %in% paste0('chr',1:22)), ]
WGSt[,2]=as.numeric(WGSt[,2])+1; WGSt[,3]=as.numeric(WGSt[,3])+1 ## Switch to 1 based coordinate
WGSn[,2]=as.numeric(WGSn[,2])+1; WGSn[,3]=as.numeric(WGSn[,3])+1 ## Switch to 1 based coordinate
rownames(WGSn)=rownames(WGSt)=paste0(WGSt$V1,"-", WGSt$V2,'-', WGSt$V3)
WGSt=WGSt[,4, drop=F]; WGSn=WGSn[,4, drop=F] ## Only keep coverage value

seg=Createobj(samplename="E34", genome_assembly="GRCh38", dir_path="Alleloscope/segs/", size=size, assay='WGS', cov=TRUE)

seg=Segmentation(Obj_filtered=seg,
                          raw_counts=WGSt, # from matched DNA sequencing (bulk/single)
                          ref_counts=WGSn, # from matched DNA sequencing (bulk/single)
                          plot_seg = T, nmean = 10) ## nmean needs to be 10 to detect focal peak in chr12

Obj_filtered$seg_table <- seg$seg_table
Obj_filtered=Segments_filter(Obj_filtered=Obj_filtered, nSNP=500)

Estimate cell major haplotypes:

Obj_filtered=Est_regions(Obj_filtered = Obj_filtered, max_nSNP = 50000, plot_stat = T, cont = FALSE)

Assign normal region:

Obj_filtered$ref=Obj_filtered$seg_table_filtered$chrr[9] # choose one normal chromosome

Assign normal cells:

tumour_type <- cbind(colnames(raw_counts), "tumor")
normal_type <- cbind(colnames(ref_counts), "normal")
cell_type <- rbind(tumour_type,normal_type)
Obj_filtered$select_normal$barcode_normal <- cell_type[which(cell_type[,2] != "tumor"),1]

Genotype each cell in each region:

Obj_filtered=Genotype_value(Obj_filtered = Obj_filtered, 
                            type='cellline', 
                            raw_counts=raw_counts, 
                            ref_counts = ref_counts, 
                            mincell = 100, cov_adj =1, 
                            refr = TRUE) # for cell line without normal cells in the tumor sample.

Lineage plot:

linplot=Lineage_plot(Obj_filtered = Obj_filtered, nSNP = 1, all_chr = T, plot_conf = T, maxcp =12)`

The lineage plot seems to roughly mirror the segmentation copy number profile

But my concern is that it seems to only include very few number of cells.

> length(rownames(Obj_filtered$genotype_values))
[1] 17

Do you have any ideas as to why this occur and any suggestions to adjust any of the filtering used in any of the steps? I appreciate you looking into this! Btw i am particularly interested in cells with high copy number in a focal region in chr12 (you can see the spike in the bulk WGS segment profile).

Hello,

You could try removing the "mincell = 100" setting when running the Genotype_value function (to exclude regions which too fews cells have reads covering) to see if it solves the problem. So it looks like Obj_filtered=Genotype_value(Obj_filtered = Obj_filtered, type='cellline', raw_counts=raw_counts, ref_counts = ref_counts, cov_adj =1, refr = TRUE) # for cell line without normal cells in the tumor sample.

Another thing I want to mention about the lineage plot is-- As you can see in this page for scATAC-seq paired with WGS/WES, the Lineage_plot is not included. For the current version, Lineage_plot is mainly for scDNA-seq.

For looking at scATAC-seq allele-specific copy numbers, the most informative plots I would suggest to look at are the "gtype_scatter...pdf" (generated by the "Genotype" function) and "hierarchcial_clustering_theta.pdf" (generated by the "Select_normal" function) in the output/plots folder. The two pdf files look like these:

You could also perform downstream analysis (integrate with the peak signals) using the results from Alleloscope. Please see the "Potential downstream analysis" section in the [page].(https://github.com/seasoncloud/Alleloscope/tree/main/samples/SU008/scATAC)

Let me know if you have questions about them.

seasoncloud / Alleloscope

Preparing inputs for bulk WGS + scATACseq #4