shahcompbio / signals

Single Cell Genomes with Allele Specificity
Other
8 stars 1 forks source link

Inputs for signals #38

Closed 899la closed 7 months ago

899la commented 1 year ago

Hi, thank you for creating this package.

I am trying to run it on our own exome data. However, I am having a problem getting the haplotype counts by following steps 10 and 11 in the single-cell-pipeline. https://github.com/shahcompbio/single_cell_pipeline/blob/master/docs/source/install.md I was able to generate haplotypes from step 10 but step 11 gives an empty count for all my cells. Do you have any suggestions?

I also tried the test data (step 11) provided in the single-cell-pipeline and the pipeline seems to work fine, where I get counts for 3 cells (SA607_3X10XB02284-A108843A-R03-C08.bam, SA607_3X10XB02284-A108843A-R03-C09.bam, SA607_3X10XB02284-A108843A-R03-C10.bam) and an empty count for SA607_3X10XB02284-A108843A-R03-C03.bam, but I guess this is expected.

Thanks for your time! Ang

marcjwilliams1 commented 1 year ago

Hi @899la thanks for your interest in the package. First, I'd note that this was initially developed with shallow whole genome sequencing. Some aspects could be useful for whole exome but for me to point you in the right direction it would be useful to know how many cells you have, whether you have a matched bulk normal (or have sequenced normal cells), and what the typical depth of coverage per cell is?

899la commented 1 year ago

Hi @marcjwilliams1 thanks for the reply. The data is from this paper https://www.nature.com/articles/s41586-021-03357-x. More specifically, I am running on 907 cells from TN7 and a matched normal from which we generate the haplotypes. The mean depth of coverage per cell is 107×.

marcjwilliams1 commented 1 year ago

Thanks, I had thought the data from this paper was also shallow whole genome sequencing using a method called ACT?

Anyhow, if you have an average of 107X per cell, you should be able to pick up heterozygous SNPs. Probably the first thing I would do is to look up a few of the het SNPs in IGV, that might help you figure out what the issue is.

One possibility is that if you're matched normal is also an exome, then the haplotype phasing may not work very well. But given you have very high per-cell coverage you likely don't need this to get reasonable copy number calls. If you want to try signals, I would recommend you try to just get counts for each heterozygous SNP. You could modify the pipeline here to use a vcf that includes your het SNPs https://github.com/marcjwilliams1/hscn_pipeline/blob/main/rules/haplotyping.smk#L50

Finally given your high depth of coverage it might be worth trying (if you haven't already) some tools developed for copy number calling in bulk tumor samples.

leachim commented 1 year ago

Hi Marc, chiming in here, as I think Ang has confused a few things. The data is indeed ACT data, i.e. shallow whole genome sequencing with coverage comparable to DLP. However, we only have normal and bulk exome, instead of whole genome sequencing available, and no normal single cells. We would now like to run signals on these data, but Ang seemed to have an issue with many of the calls being empty. My suspicion was this is to do with the exome sequencing for the normal. Do you have an intuition if it's possible to use the exome normal (possibly by doing some form of imputation), or might there be other reasons?

marcjwilliams1 commented 1 year ago

Yes I think that's right, if you just have an exome to identify SNPs then you'll only have about 1% of the SNPs so many cells won't have coverage for that 1%. It might be possible to impute, but it's not something I've looked into.

Another option is to try and call het SNPs from the tumor cells. That will be difficult/impossible in clonal LOH regions but should be feasible in other regions. You could do a pileup of all non reference positions across a merged pseudobulk, the het SNPs should have allelic fraction >0 & < 1 (+noise). If you try it, would also probably be a good idea to do an overlap with 1K genomes or some other database to reduce false positives.