Beneficial to pool .bam files from seperate 10x single-cell lanes instead of demultilplexing seperately?

koefoeden commented 1 year ago

Hi - thanks for a nice tool! I have a quick question regarding the best approach for using cellsnp/vireo in our experimental setup: We have generated data from 8 separate 10x Genomics single-cell reactions, which are all made from pools of the same 8 donors.

My idea is that by pooling the .bam files (and making sure that the barcodes are differentiated appropriately to avoid clashing) before demultiplexing, the demultiplexing will have access to 8x sequencing depth per individual. However, the sequencing depth per cell will of course stay the same, as we have 8x reads and 8x cells. My question is thus whether there is some kind of information sharing between the cells, which enables a better demultiplexing from merely having more cells from the same number of possible donors.

I hope my question makes sense!

huangyh09 commented 1 year ago

Hi,

Good idea - it may enhance the estimation of the donors' genotype and link them across samples. On the other hand, as you mentioned the coverage for each cell won't change; therefore, it is usually unnecessary to pool multiple samples if the n_cells in each pool is reasonable (e.g., >3K).

Yuanhua

koefoeden commented 1 year ago

Yes, so that was my thought - I have prepared the pooled .bam-files, so will go ahead and try it and let you know the results. Do you have a feeling which mode of cellsnp-lite would yield the most improvement when doing this, i.e. SNP-based or chromosome-pileup? And how the --forcelearnGT parameter in Vireo might play in here?

mdmanurung commented 10 months ago

@koefoeden Did it go well? Do you have an example script to merge the bam files and rewriting the barcode IDs? Thanks in advance!

koefoeden commented 10 months ago

@mdmanurung

It did not seem to change the results a lot, so we never went further with it. But in case you want to try it out, here are the two scripts I wrote.

The first uses awk, sed and samtools to modify the barcodes.tsv.gz and atac_possorted_bam.bam files (and save them as new files) in the cellranger-arc output directory, i.e. argument 1, to instead be suffixed by the given $number parameter, i.e. argument 2, to avoid barcode clashing:

#!/bin/bash
subdir=$1
subdir_base_name=$(basename $subdir)
number=$2

# **** 1) Edit the barcodes.tsv.gz file ****
echo "Processing barcodes.tsv.gz file for $subdir"
gzip -dc "$subdir/outs/filtered_feature_bc_matrix/barcodes.tsv.gz" | \
awk -v num="$number" '{split($0, a, "-"); print a[1]"-"num}' | \
gzip > "$subdir/outs/filtered_feature_bc_matrix/barcodes_modified.tsv.gz"

# **** 2) Edit the atac_possorted_bam.bam file ****
echo "Processing atac_possorted_bam.bam file for $subdir"
samtools view -h "$subdir/outs/atac_possorted_bam.bam" | \
sed -E "s/(CB:Z:[A-Z]{16}-)1/\1${number}/g" | \
samtools view -b -o "$subdir/outs/atac_possorted_bam_modified.bam" -
echo "Everything completed for $subdir"

The second script basically merges the modified barcode and bam files, assuming that each cellranger-output directoy is labelled 1, 2, 3, 4, ,5, 6, 7, 8. in $cellranger_outs_dir, i.e. the first argument.

cellranger_outs_dir=$1
combined_out_dir=$2

mkdir -p ${combined_out_dir}

# Concatenate barcodes.tsv.gz files
 zcat "$cellranger_outs_dir"/{1..8}/outs/filtered_feature_bc_matrix/barcodes_modified.tsv.gz > ${combined_out_dir}/barcodes.tsv
gzip ${combined_out_dir}/barcodes.tsv

# Concatenate atac_possorted_bam.bam files
samtools merge ${cellranger_outs_dir}/{1..8}/outs/atac_possorted_bam_modified.bam \
-o ${combined_out_dir}/atac_possorted_bam.bam \
--write-index

And then finally it should be possible to use Vireo/cellSNP as normal on these files.

ghuls commented 9 months ago

If you want something faster, you can use bam_update_cell_barcode from https://github.com/aertslab/single_cell_toolkit/:

git clone https://github.com/aertslab/single_cell_toolkit/
cd single_cell_toolkit/rust

cargo build --bin bam_update_cell_barcode

./bam_update_cell_barcode 
Usage: bam_update_cell_barcode input.bam output.bam barcode_mapping.tsv old_bc_tag new_bc_tag

Where the barcode_mapping.tsv looks like this (old CB "\t" new CB):
AAACCTGAGAAACCAT-1      AAACCTGAGAAACCAT-2
AAACCTGAGAAACCGC-1      AAACCTGAGAAACCGC-2
AAACCTGAGAAACCTA-1      AAACCTGAGAAACCTA-2
AAACCTGAGAAACGAG-1      AAACCTGAGAAACGAG-2
AAACCTGAGAAACGCC-1      AAACCTGAGAAACGCC-2
AAACCTGAGAAAGTGG-1      AAACCTGAGAAAGTGG-2
AAACCTGAGAACAACT-1      AAACCTGAGAACAACT-2
AAACCTGAGAACAATC-1      AAACCTGAGAACAATC-2
AAACCTGAGAACTCGG-1      AAACCTGAGAACTCGG-2
AAACCTGAGAACTGTA-1      AAACCTGAGAACTGTA-2

And old_bc_tag: "CB" (or whatever tag contains the current barcode.
And new_bc_tag the tag to which you want to write the new barcode: e.g.: "CB" to overwrite, or "de" to write the new barcode to a new tag.

Changing the previous script a bit:

#!/bin/bash
subdir=$1
subdir_base_name=$(basename $subdir)
number=$2

# **** 1) Edit the barcodes.tsv.gz file ****
echo "Processing barcodes.tsv.gz file for $subdir"
gzip -dc "$subdir/outs/filtered_feature_bc_matrix/barcodes.tsv.gz" | \
awk -v num="$number" '{split($0, a, "-"); print a[1]"-"num}' | \
gzip > "$subdir/outs/filtered_feature_bc_matrix/barcodes_modified.tsv.gz"

# **** 1) Create the barcode_mapping.tsv file ****
paste \
    <(gzip -dc "$subdir/outs/filtered_feature_bc_matrix/barcodes.tsv.gz") \
    <(gzip -dc "$subdir/outs/filtered_feature_bc_matrix/barcodes_modified.tsv.gz") \
 > "$subdir/outs/filtered_feature_bc_matrix/barcodes_mapping.tsv"

# **** 3) Edit the atac_possorted_bam.bam file with bam_update_cell_barcode ****
echo "Processing atac_possorted_bam.bam file for $subdir"

bam_update_cell_barcode \
    "$subdir/outs/atac_possorted_bam.bam" \
    "$subdir/outs/atac_possorted_bam_modified.bam" \
    "$subdir/outs/filtered_feature_bc_matrix/barcodes_mapping.tsv" \
    CB \
    CB

echo "Everything completed for $subdir"

Modifying cell barcodes in a 6.3G BAM file took 164 seconds

koefoeden commented 9 months ago

@ghuls

ah, great - thanks for the tip! :)

mdmanurung commented 9 months ago

Thanks all for the tip!

single-cell-genetics / vireo

Beneficial to pool .bam files from seperate 10x single-cell lanes instead of demultilplexing seperately? #89