Closed cnk113 closed 3 years ago
You don't need to actually split the BAM - Octopus can handle multi-sample BAM files. However, samples needs to be indicated with the appropriate tag, SM
in a read group. I would expect that there are other tools that can add these tags to a BAM from single-cell barcodes.
Ah didn't know that the multi-sample also applied to the single cell caller, and generally the read group for single cell assays use CB as the read group, but I can easily fix that. Also, for the parameter --max-phylogeny-size if I decide to increase that how much would this effect runtime in respect to the number of cells?
Also, for the parameter --max-phylogeny-size if I decide to increase that how much would this effect runtime in respect to the number of cells?
I assume that you mean --max-clones
? I'd expect this to have a larger impact the more cells that you have. The reason being that this a) increases the chance of a true clone being present. b) increases the chance of spurious clones due to amplification artefacts. Of course, if --max-clones
is less than the true number of detectable clones at a locus, then you'll have false calls.
Just to confirm, this is DNA single-cell sequencing data? I'm currently working more on the cell
calling model but haven't tested performance on thousands of cells.
Yeah, for some reason when I look at the arguments in my CLI (conda install) it says max-phylogeny-size. I'm using deeply sequenced scATAC-seq data so the genome coverage is much less than scWGS. Depth wise I'm getting ~50X per cell in a 5000 cell dataset. I'm planning on comparing the phylogeny with the mitochondrial lineage data in the same cells for validation.
Ah, the Bioconda version (v0.6.3b) is very old. I wouldn't even bother trying the cell
calling model in that version as it was just a prototype at that stage. I'm hoping to update the Bioconda version soon.
But even with the latest version, I think 5000 50x cells is going to be challenging - I haven't tested anywhere near that amount of cells and the model was primarily aimed at (MDA) scWGA data. Don't be surprised if it doesn't work out of the box :)
So I just finished my run, however I'm confused, it says it only detects one sample?
[2021-04-14 15:02:10] <WARN> The cell calling model is still in development. Do not use for production work!
[2021-04-14 15:02:10] <INFO> Done initialising calling components in 14m 5s
[2021-04-14 15:02:10] <INFO> Detected 1 sample: "mg_cellranger"
[2021-04-14 15:02:10] <INFO> Invoked calling model: cell
Does that mean it considered only 1 "cell" for the bam? The bam file has the SM tag with the cell IDs.
Can you post a subset of the BAM file including the header?
sub2.sam.gz Hopefully this works, it's in sam, you may(?) run into an EOF error but that should easily be fixable.
Thanks, Chang
@RG ID:mg_cellranger:MissingLibrary:1:HVKCNBGXF:1 SM:mg_cellranger LB:MissingLibrary.1 PU:mg_cellranger:MissingLibrary:1:HVKCNBGXF:1 PL:ILLUMINA
@RG ID:mg_cellranger:MissingLibrary:1:HVKCNBGXF:2 SM:mg_cellranger LB:MissingLibrary.1 PU:mg_cellranger:MissingLibrary:1:HVKCNBGXF:2 PL:ILLUMINA
@RG ID:mg_cellranger:MissingLibrary:1:HVKCNBGXF:3 SM:mg_cellranger LB:MissingLibrary.1 PU:mg_cellranger:MissingLibrary:1:HVKCNBGXF:3 PL:ILLUMINA
@RG ID:mg_cellranger:MissingLibrary:1:HVKCNBGXF:4 SM:mg_cellranger LB:MissingLibrary.1 PU:mg_cellranger:MissingLibrary:1:HVKCNBGXF:4 PL:ILLUMINA
@RG ID:mg_cellranger:MissingLibrary:1:HLYFCDSXY:1 SM:mg_cellranger LB:MissingLibrary.1 PU:mg_cellranger:MissingLibrary:1:HLYFCDSXY:1 PL:ILLUMINA
@RG ID:mg_cellranger:MissingLibrary:1:HLYFCDSXY:2 SM:mg_cellranger LB:MissingLibrary.1 PU:mg_cellranger:MissingLibrary:1:HLYFCDSXY:2 PL:ILLUMINA
@RG ID:mg_cellranger:MissingLibrary:1:HLYFCDSXY:3 SM:mg_cellranger LB:MissingLibrary.1 PU:mg_cellranger:MissingLibrary:1:HLYFCDSXY:3 PL:ILLUMINA
@RG ID:mg_cellranger:MissingLibrary:1:HLYFCDSXY:4 SM:mg_cellranger LB:MissingLibrary.1 PU:mg_cellranger:MissingLibrary:1:HLYFCDSXY:4 PL:ILLUMINA
All of the sample IDs (SM
) are the same - you need unique sample IDs for each sample
Ooops, I accidentally added SM tags on the actual reads. I'll rename RG IDs and add SM tag in the correct place. Will update how it goes in a few days. Thanks!
Hello,
I was wondering if it was possible to call somatic variants using a single bam files which would be split based on a cell barcode tag? It's a rather time consuming hurdle to split bam files into different cells especially when you have over hundreds/thousands of cells with decent coverage. Hopefully this wouldn't be too hard to implement?
Best, Chang