shahcompbio / signals

Single Cell Genomes with Allele Specificity
Other
7 stars 1 forks source link

Generating SIGNALS input using other sequencing data #54

Open shngk opened 5 months ago

shngk commented 5 months ago

Hello, I am trying to generate the input for SIGNALS using hmmcopy and counthaps; we have single-end data and we have realized that our data doesn't work with your software, specifically for the count_haplotypes step of the pipeline. We tried modifying the bam comments and the sam flags to no avail. Do you have any recommendations on how to get this step to work or is there a good reason to not use your software on our data? Any advice is appreciated!

marcjwilliams1 commented 5 months ago

Hello, sorry for the slow reply. We have never worked with single-end data. Here are some suggestions:

  1. Try mondrian if you have been using single cell pipeline. If you get an error message you can try posting there.
  2. If you have inferred the haplotypes and just need the genotyping step you should be able to use any number of tools for this. For example I have some code to do this using cellSNP here
shngk commented 3 months ago

Thanks for these helpful suggestions. I'm trying to run SIGNALS on our input generated using mondrian on single-end data (where we converted from single-end to paired-end format), and I get an error when utilizing the callHaplotypeSpecificCN function where it fits the binomial model, I'm not sure if it's due to these format differences or some other error related to coverage, # of bins, etc.

Inferred mean: 0.42, Expected mean: 0.4, Inferred overdispersion (rho): 0
Tarones Z-score: -0.872, using binomial model for inference.
Using 7 cells for clustering...
Clustering chromosome 13
Creating CN matrix...
Calculating UMAP dimensionality reduction...
An error occurred in umap calculation: n_neighbors must be >= 2
Rerunning UMAP after adding small jitter to data points...
Error in uwot(X = X, n_neighbors = n_neighbors, n_components = n_components,  : 
  n_neighbors must be >= 2
In addition: Warning messages:
1: In checkwz(wz, M, trace = trace, wzepsilon = control$wzepsilon) :
  5 diagonal elements of the working weights variable 'wz' have been replaced by 0.000000000001819
2: In checkwz(wz, M = M, trace = trace, wzepsilon = control$wzepsilon) :
  5 diagonal elements of the working weights variable 'wz' have been replaced by 0.000000000001819
3: In checkwz(wz, M = M, trace = trace, wzepsilon = control$wzepsilon) :
  3 diagonal elements of the working weights variable 'wz' have been replaced by 0.000000000001819
4: In checkwz(wz, M = M, trace = trace, wzepsilon = control$wzepsilon) :
  2 diagonal elements of the working weights variable 'wz' have been replaced by 0.000000000001819
5: In vchol(wz, M = M, n = n, silent = !trace) :
  weight matrix 1 not positive-definite

I also have input from another dataset using Illumina sequencing and when using the format_haplotypes_dlp function, all of the bins are removed/filtered:

Number of distinct bins in copy number data: 2288
Number of distinct bins in haplotype data: 197
Number of distinct bins in formatted haplotype data: 0 
marcjwilliams1 commented 2 months ago

For the first issue it looks like it's going wrong during the per chromosome clustering used to phase the haplotype blocks per chromosome. You can try setting cluster_per_chr=FALSE, to disable this. In general clustering per chromosome produces better results but this hopefully will get you some initial results to inspect.

For the second issue, look like most of the genome does not have SNP coverage (only 197 bins). Are you perhaps using exome sequencing for the germline SNP inference?