luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
301 stars 37 forks source link

Feature request: single cell calling on a single bam? #165

Closed cnk113 closed 3 years ago

cnk113 commented 3 years ago

Hello,

I was wondering if it was possible to call somatic variants using a single bam files which would be split based on a cell barcode tag? It's a rather time consuming hurdle to split bam files into different cells especially when you have over hundreds/thousands of cells with decent coverage. Hopefully this wouldn't be too hard to implement?

Best, Chang

dancooke commented 3 years ago

You don't need to actually split the BAM - Octopus can handle multi-sample BAM files. However, samples needs to be indicated with the appropriate tag, SM in a read group. I would expect that there are other tools that can add these tags to a BAM from single-cell barcodes.

cnk113 commented 3 years ago

Ah didn't know that the multi-sample also applied to the single cell caller, and generally the read group for single cell assays use CB as the read group, but I can easily fix that. Also, for the parameter --max-phylogeny-size if I decide to increase that how much would this effect runtime in respect to the number of cells?

dancooke commented 3 years ago

Also, for the parameter --max-phylogeny-size if I decide to increase that how much would this effect runtime in respect to the number of cells?

I assume that you mean --max-clones? I'd expect this to have a larger impact the more cells that you have. The reason being that this a) increases the chance of a true clone being present. b) increases the chance of spurious clones due to amplification artefacts. Of course, if --max-clones is less than the true number of detectable clones at a locus, then you'll have false calls.

Just to confirm, this is DNA single-cell sequencing data? I'm currently working more on the cell calling model but haven't tested performance on thousands of cells.

cnk113 commented 3 years ago

Yeah, for some reason when I look at the arguments in my CLI (conda install) it says max-phylogeny-size. I'm using deeply sequenced scATAC-seq data so the genome coverage is much less than scWGS. Depth wise I'm getting ~50X per cell in a 5000 cell dataset. I'm planning on comparing the phylogeny with the mitochondrial lineage data in the same cells for validation.

dancooke commented 3 years ago

Ah, the Bioconda version (v0.6.3b) is very old. I wouldn't even bother trying the cell calling model in that version as it was just a prototype at that stage. I'm hoping to update the Bioconda version soon.

But even with the latest version, I think 5000 50x cells is going to be challenging - I haven't tested anywhere near that amount of cells and the model was primarily aimed at (MDA) scWGA data. Don't be surprised if it doesn't work out of the box :)

cnk113 commented 3 years ago

So I just finished my run, however I'm confused, it says it only detects one sample?

[2021-04-14 15:02:10] <WARN> The cell calling model is still in development. Do not use for production work!
[2021-04-14 15:02:10] <INFO> Done initialising calling components in 14m 5s
[2021-04-14 15:02:10] <INFO> Detected 1 sample: "mg_cellranger"
[2021-04-14 15:02:10] <INFO> Invoked calling model: cell

Does that mean it considered only 1 "cell" for the bam? The bam file has the SM tag with the cell IDs.

dancooke commented 3 years ago

Can you post a subset of the BAM file including the header?

cnk113 commented 3 years ago

sub2.sam.gz Hopefully this works, it's in sam, you may(?) run into an EOF error but that should easily be fixable.

Thanks, Chang

dancooke commented 3 years ago
@RG     ID:mg_cellranger:MissingLibrary:1:HVKCNBGXF:1   SM:mg_cellranger        LB:MissingLibrary.1     PU:mg_cellranger:MissingLibrary:1:HVKCNBGXF:1   PL:ILLUMINA
@RG     ID:mg_cellranger:MissingLibrary:1:HVKCNBGXF:2   SM:mg_cellranger        LB:MissingLibrary.1     PU:mg_cellranger:MissingLibrary:1:HVKCNBGXF:2   PL:ILLUMINA
@RG     ID:mg_cellranger:MissingLibrary:1:HVKCNBGXF:3   SM:mg_cellranger        LB:MissingLibrary.1     PU:mg_cellranger:MissingLibrary:1:HVKCNBGXF:3   PL:ILLUMINA
@RG     ID:mg_cellranger:MissingLibrary:1:HVKCNBGXF:4   SM:mg_cellranger        LB:MissingLibrary.1     PU:mg_cellranger:MissingLibrary:1:HVKCNBGXF:4   PL:ILLUMINA
@RG     ID:mg_cellranger:MissingLibrary:1:HLYFCDSXY:1   SM:mg_cellranger        LB:MissingLibrary.1     PU:mg_cellranger:MissingLibrary:1:HLYFCDSXY:1   PL:ILLUMINA
@RG     ID:mg_cellranger:MissingLibrary:1:HLYFCDSXY:2   SM:mg_cellranger        LB:MissingLibrary.1     PU:mg_cellranger:MissingLibrary:1:HLYFCDSXY:2   PL:ILLUMINA
@RG     ID:mg_cellranger:MissingLibrary:1:HLYFCDSXY:3   SM:mg_cellranger        LB:MissingLibrary.1     PU:mg_cellranger:MissingLibrary:1:HLYFCDSXY:3   PL:ILLUMINA
@RG     ID:mg_cellranger:MissingLibrary:1:HLYFCDSXY:4   SM:mg_cellranger        LB:MissingLibrary.1     PU:mg_cellranger:MissingLibrary:1:HLYFCDSXY:4   PL:ILLUMINA

All of the sample IDs (SM) are the same - you need unique sample IDs for each sample

cnk113 commented 3 years ago

Ooops, I accidentally added SM tags on the actual reads. I'll rename RG IDs and add SM tag in the correct place. Will update how it goes in a few days. Thanks!