piyalkarum / rCNV

An R package for detecting copy number variants from SNPs data
https://piyalkarum.github.io/rCNV/
GNU Affero General Public License v3.0
5 stars 1 forks source link

Large difference between CNVs and deviants #4

Closed laneatmore closed 3 months ago

laneatmore commented 6 months ago

This is such a cool program! I am trying to use it to identify CNVs and paralogs for a salmon species and have run into a couple of strange things. The first is that it seems necessary to conduct filtering for biallelic loci prior to inputting into R. The program is a bit buggy if you try to do the maf filtering while these loci are included (NAs introduced by coercion).

Do you know what the influence of pre-filtering is? I have been running rCNV on a set of chromosomes that I first filtered for missingness, maf, and locus quality with vcftools, bcftools, and GATK. I would assume this shouldn't change the outcome drastically, as filtering is part of the rCNV pipeline.

That being said, I have run the following commands:

deviants<-dupGet(A.info,test = c("z.all","chi.all"),plot=TRUE,verbose = TRUE) CV<-cnv(A.info, test=c("z.all","chi.all"), filter = "kmeans", plot=TRUE)

(Using *.all because I have some reference bias)

The output of these two tests is wildly different:

deviant non-deviant 
      1       62392 

cnv non-cnv 
     18740   43653 

Why would these tests have such different outcomes?

The dataset is quite small, only 18 individuals, and this example is just chromosome 1 but the pattern is consistent across all chromosomes.

piyalkarum commented 6 months ago

Hi,

Thank you for using our package.

Regarding the maf function, thank you for reporting this. You're right about the fact that it produces NAs. However, this is not related to the presence of multi-allelic sites. This stems from the fact that we have not updated maf function since the format change of the hetTgen function, and it tries to get depth values from the alternative allele (ALT) column. This has been fixed now and should work properly.[tip: REINSTALL THE PACKAGE FROM GITHUB]. Further, I also added a new functionality to make sure that the users understand what exactly the maf function does. In essence, maf (if drop.multi=TRUE) drops all sites with more than two alleles OR (if drop.multi=FALSE) drops the alleles with the minimum frequencies in multi-allelic sites and retains only two alleles per site. This means that all sites are assumed to be bi-allelic by keeping only the two major alleles. This situation is ideal when you have mostly multi-allelic sites. Read the help page of maf function for more details.

Regarding filtering, we have indeed included filtering steps in the rCNV pipeline itself. These, however, are not particularly different from filtering with other software (e.g., vcftools, bcftools, GATK) for the same variables (e.g. missingness, MAF, etc.). As a matter of fact, we recommend using computationally faster software to do filtering for large datasets. Pre-filtering outside of rCNV R-environment should not introduce any differences to the analysis. However, you should make sure that the following filtering steps have been applied before you generate allele.info (using allele.info() function):

  1. missing SNPs and samples higher than 50% removed
  2. highly related samples (relatedness > 0.9) and excess heterozygous samples are removed
  3. misclassification and odd number depth values have been corrected (as in ad.correct) AND
  4. allele depth values have been normalized using an appropriate method (as in cpm.normal) Note: the last two steps are easier done with rCNV.

Looking at your deviants and cnv, I can say that it is highly likely that the input data was not optimally filtered. Therefore, I suggest you perform the filtering again carefully. However, if you still see this large difference, here are some suggestions to troubleshoot where the pipeline is not working well on your data:

  1. check for excess NAs (or "./.") in the allele depth table (the output of hetTgen)
  2. check if sequencing bias actually exists in your data. You can do this with the output plot of allele.info() by setting plot.allele.cov = TRUE . If you don't see sequencing bias, try using *.05 test option in dupGet and cnv functions.
  3. try using different test methods in the dupGet and cnv functions. For instance, you can use test=c("z.all","chi.05") or any other combinations.
  4. if kmeans produces a significantly different (higher) number of cnvs than deviants, use intersection filter in the cnv function. Although I listed this tip at the end, I think you should test this first before any of the above for a quick check of your results. This will quickly give you an idea of the quality of your data; kmeans is very sensitive to the quality of the data whereas intersection is not.

I hope these tips will help you. Please feel free to reach out if you have further issues/suggestions.

Best of luck!

P.S. DON'T FORGET TO RE-INSTALL THE PACKAGE FROM GITHUB. New updates will be added to CRAN soon

laneatmore commented 5 months ago

Hi Piyal,

Thanks so much for this.

I have reinstalled from Github and re-ran everything with a more stringent MAF filter (5%). I did a quick sanity check of the QC with rCNV.

rCNV outputs the missingness statistics like so: Screenshot 2024-05-02 at 6 32 24 PM

I've already filtered for relatedness, and a quick sanity check with rCNV shows there are no related samples: Screenshot 2024-05-02 at 6 33 08 PM

I then go to normalize my data and rCNV tells me there are outliers detected:

 ad.tab <-hetTgen(vcf,info.type = 'AD')
 gt<-hetTgen(vcf,info.type = 'GT')
 ad.tab<-ad.correct(ad.tab,gt.table=gt)

 par(mfrow=c(1,1))
 ad.nor<-cpm.normal(ad.tab,method="MedR")

 OUTLIERS DETECTED
 Consider removing the samples:
 M.Cheh_fem1 M.Dee_fem1 M.Dee_fem2

Screenshot 2024-05-02 at 6 52 45 PM

(I don't understand the x axis here)

I have checked for reference bias using allele.info(), which shows there is definitely bias in the dataset:

Screenshot 2024-05-02 at 7 02 29 PM

There is much higher coverage of ref allele in both homo and heterozygotes, but overall higher coverage for heterozygotes than homozygotes. This VCF contains ONLY chromosome 1.

Again testing for deviants/CNVS:

deviants<-dupGet(A.info,test = c("z.all","chi.all"),plot=TRUE,verbose = TRUE)
CV<-cnv(A.info, test=c("z.all","chi.all"), filter = "kmeans", plot=TRUE)
 table(CV$dup.stat)
    cnv non-cnv 
    18688   43705
 table(deviants$dup.stat)
   deviant non-deviant 
       1       62392 

When I use "intersection" instead of "kmeans," I get 0 CNVs. This is obviously much closer to the count from deviants, but this doesn't make sense. This is a species with known residual tetrasomy (up to 20% of the genome), so I should be able to find more than one deviant/CNV. Screenshot 2024-05-02 at 7 14 10 PM

Could this also be a question of sample number? Unfortunately, I am relying entirely on public data for this and do not have access to a large dataset for this species. The distribution of proportion of heterozygotes is thus less continuous than in a larger dataset.

I am running this per chromosome as I would very much like to be able to look at the sliding window with dup.validate to make sure the residual tetrasomy is appearing in the correct regions (subtelomeric) (I haven't been able to get that running properly either, but perhaps that is a problem for a later stage, like when I am successful in getting the Github version installed :) ).

I'm still exploring different parameter settings for cnv and dup.get, but I'd be grateful for any insight you have!

Again, thanks so much for your help!

piyalkarum commented 5 months ago

Hi,

Thank you for providing detailed information. After checking your data, I realized that you have only 16 samples, and as you guessed, this may be a cause for concern. Allow me to elaborate more on that. The big difference between the deviants and CNVs you see may be due to all or a combination of the following factors.

  1. Since the rCNV algorithm relies on heterozygotes, if most of your snps have a proportionately low number of heterozygotes, the deviation of allele ratio increases and may result in false positives or negatives. However, this also depends on the depth values you have in your dataset. Check the supplementary figure S1 in our publication for a reference on this.
  2. when the number of samples is low, the Bonferroni correction we apply on the p-value of Z and Chi values inflates the expected threshold, leading to false negatives. This is why you don't see any deviants in your dataset.
  3. In cases where there is a significantly low number of deviants. The CNVs detected by cnv() function using kmeans filtering is more reliable because, unlike the p-value mentioned in number 2, k-means is not affected by the low number of samples.
  4. If your data is whole genome sequencing, despite having slight reference bias as in your case, it is recommended to use .05 stat values in the deviant and cnv detection steps. Reference bias correction is for the most part, for exome capture sequencing.

Based on these, here are my recommendations: I. check how many of your snps have more than 5 heterozygotes and get the mean depth values for them. You can check where they fall in the allele ratio deviation plot (as in Fig S1A) by using the function depthVsSample() and plotting your snps on the same plot. ii. use z.05 and chi.05 with dupGet and see if you get any deviants and compare them with what you get with cnv() function. iii. use kmeans in the cnv() function directly (regardless of what you get from dupGet()) and plot the "Proportion of Heterozygotes" vs "Allele Median Ratio" plot. In this plot, you should be able to see that the cnvs are plotted correctly. To check them further, you could obtain the total and mean depth values of each snp and plot it along the chromosome, like in the figure 3 of the rCNV Molecular Ecology paper; if your cnvs are clustered together in specific regions, that is a reliable detection. iv. This might take a little bit of experimental scripting from your side, but you could also try to plot the Z, chi, and excess of heterozygotes separately like in the figure 2 of the rCNV paper. All that information is there in the allele.info() output.

My co-authors and I are discussing a solution to the low number of samples, like in your case, and we will implement a different FDR to the p-values in the future.

I hope this helps. Thanks again for using our package and reporting issues. Please feel free to reach out if you have further issues.

Piyal