statgen / ruth

Robust Unified Hardy-Weinberg Equilibrium Test
Apache License 2.0
6 stars 2 forks source link

Chromosome X #10

Open angelosarmen opened 1 year ago

angelosarmen commented 1 year ago

Hi,

Can RUTH handle chromosome X? There isn't a mention of it in the manuscript, but the program accepts parameters for sex chromosomes. When I try to load a BCF file containing X-chromosome SNPs, however, I get a segmentation fault. It works fine if I remove those SNPs.

I compiled RUTH on macOS Ventura using htslib 1.18 installed with conda.

hyunminkang commented 1 year ago

It should work for X-chromosome, but it will assume diploid representation of genotype likelihood, and may not accept haploid genotype information. Also, the sex map should be provided. Please provide specific examples of the failure mode if you would like to report a bug.

angelosarmen commented 1 year ago

Thank you for the prompt reply.

I confirmed that segmentation fault occurred because in my (PLINK-generated) BCF file, male non-PAR X genotypes were encoded by a single allele (as is the standard: "Haploid calls, e.g. on Y, male non-pseudoautosomal X, or mitochondria, should be indicated by having only one allele value."). After removing the male samples, RUTH ran without problems.

I'm attaching a set of test files that recreate the problem. samples.vcf has 20 samples, the first 10 of which are females and the remaining 10 are males (as specified in sex.tsv), and two X SNPs, one in the PAR and one in the non-PAR. Non-PAR male genotypes are represented by a single allele. The following command results in segmentation fault:

ruth --vcf samples.vcf --evec pcs.tsv --num-pc 2 --field GT --lambda 0 --lrt-em --skip-if --site-only --out ruth.vcf

After replacing the single alleles with two identical alleles and saving the file as samples_fixed.vcf, the following command succeeds:

ruth --vcf samples_fixed.vcf --evec pcs.tsv --num-pc 2 --field GT --lambda 0 --lrt-em --skip-if --site-only --out ruth.vcf

So, for RUTH to work, I should be replacing single alleles with two identical alleles for the male samples?

Finally, how does RUTH handle the PAR and the non-PAR? Does it perform the test using all samples for the PAR and females only for the non-PAR?

angelosarmen commented 1 year ago

@hyunminkang, could you please provide an update on this.

hyunminkang commented 1 year ago

Did you provide --sex-map? I think you need to do so. Otherwise, it will consider all samples as females.

Also, you need to provide genotype likelihood (PL) as diploids. I think it may work well with --field GT as it is not very well for chrX. If you insist using GT, you can use it at your own risk, but it would require diploid representation, making each genotypes as homozygotes.

Yes, It will use ploidy information for non-PAR only.

angelosarmen commented 1 year ago

I provided --sex-map in my analysis, but forgot to provide it in the tests. This still results in segmentation fault:

ruth --vcf samples.vcf --evec pcs.tsv --sex-map sex.tsv --num-pc 2 --field GT --lambda 0 --lrt-em --skip-if --site-only --out ruth.vcf

while this doesn't:

ruth --vcf samples_fixed.vcf --evec pcs.tsv --sex-map sex.tsv --num-pc 2 --field GT --lambda 0 --lrt-em --skip-if --site-only --out ruth.vcf

The error doesn't matter anymore, as I'm supposed to provide male genotypes as diploids anyway (I don't have the likelihoods, so I have to use the genotypes).

Using only females for the non-PAR is not the same as using ploidy information, though. To find out what you mean by "using ploidy information", I looked at the code of frequency_estimator::estimate_isaf_em_hwd() and understand that males are used to initialize individual-specific allele frequencies with estimate_pooled_af_em() and estimate the betas, but not theta (only females are used for that). What is being tested for the non-PAR, however, when all samples are male (i.e., all samples are haploid)? Same for chromosome Y and the mitochondria. I don't understand what testing for HWE means in those cases.