No significant T-MML, T-NME and PDM-haps found in samples validated for X-skew testing

nmfad commented 5 months ago

Hi Jordi,

I am testing samples that have been validated for X-skew testing to show skewed XCI using clinical X-skew testing as well as monoallelic expression using DNA & RNA sequencing.

I am now testing the same samples for allele specific methylation on the X chromosome using CPEL and although I was able to get CPEL to work smoothly with no errors, I am not seeing any significant p-values in the following files

__tmml_null.bedGraph - empty __tnme_null.bedGraph - empty __tpdm_null.bedGraph - empty Here are the steps I followed for each sample 1) WGS --> alignment using GATK -- HET VCF --> high confidence HETS (DP >=10 and GQ>=20) - This gave me roughly ~60k het SNVs across the X chromosome. 2) N-masking of the X chromosome fasta sequence (build 37) using SNPsplit using the ~60k SNVs above. 3) Alignment to the N-masked reference genome using WGBS data and BISMARCK. (trimming was done using Trimgalore to remove adapters) and only reads with MQ>=20 were retained. 3) Splitting of the BISMARCK bam generated above using 60k SNVs into allele1.bam and allele2.bam using SNPsplit. A majority of the reads within the n-masked bam were unassigned as they didnt overlap the 60k SNVs, understandably so, but allele1 and allele2 had roughly ~300k reads assigned to them 4) Use whatshap to generate a phased VCF from the VCF generated in step1 - Roughly ~30k variants could be phased (50% of the original) 5) Use run_analysis(b1,b2,b1,vcf,fa,out;win_exp=10,n_null=1000,cov_ths=5) within CPEL - The VCF supplied here is the phased VCF from whatshap, the b1 and b2 are the bams generated in step3 and fasta file is generated by snpsplit in step 2. The program runs for rougly 45 minutes, but I do not get any significant hits. All of the WGBS is deep sequenced and I used high confidence HET SNVs. Could you suggest what the issue might be ?? Do I not need to filter the HET SNVs and use all of them without applying any quality filters in step 1? - Is that the main issue here ?? Any insight from you will be very helpful. If I dont put quality filters on the HET SNVs, each sample has roughly ~70k HET SNVs which can be used towards N-masking and allele splitting. But when using whatshap phasing, they are still reduced down to ~30-40k SNVs. Regards Numrah

jordiabante commented 5 months ago

Hi Numrah,

The steps look fine to me in general. Is there any information that cpelasm prints out? I think your point about the HET SNVs will only affect the alignment rate in step 3, but shouldn't have an effect otherwise...

Are only passing a single sample at a time? That is, b1 and b2 should be a BAM files with haplotype 1 and 2, respectively, for the same sample. It's a little suspicious that there is no prefix in the output file names...

Also, are you able to run the toy example just fine and see the output?

Thanks,

nmfad commented 5 months ago

Hi Jordi,

Thanks so much for responding back. Answering your questions below.

Is there any information that cpelasm prints out? - Yes I do see that it prints out information of the format below, looks something like this towards the end. 2024-05-27 19:46:51: No null statistics computed in 999-th attempt ... 2024-05-27 19:46:51: No null statistics computed in 1000-th attempt ... 2024-05-27 19:46:51: No null statistics computed in 1001-th attempt ... 2024-05-27 19:46:51: Exceeded 1000 attempts in comp_tnull ... 2024-05-27 19:46:51: Done with null statistics ... 2024-05-27 19:46:51: Computing p-values in heterozygous loci ... 2024-05-27 19:46:51: Computing p-values Tmml ... 2024-05-27 19:46:51: Computing p-values Tnme ... 2024-05-27 19:46:51: Computing p-values Tpdm ... 2024-05-27 19:46:51: Done with p-values ...

Are only passing a single sample at a time? That is, b1 and b2 should be a BAM files with haplotype 1 and 2, respectively, for the same sample. It's a little suspicious that there is no prefix in the output file names...

Yes I am only running one sample at a time currently using the Julia compiler. The output I see is as follows. 1.1M _chrX_whatshap_phased_het.cpelasm.gff 13M _chrX_whatshap_phased_hom.cpelasm.gff 16K _mml1.bedGraph 16K _mml2.bedGraph 17K _nme1.bedGraph 17K _nme2.bedGraph 17K _pdm.bedGraph 0 _tmml_null.bedGraph 0 _tnme_null.bedGraph 0 _tpdm_null.bedGraph It does have an output prefix of the form .

Also, are you able to run the toy example just fine and see the output? - Yes I was able to run the Toy example successfully and also see the outputs as follows. No issues there. I am seeing significant p-values generated in the *_pvals.bedgraph files as well. example_het.cpelasm.gff example_hom.cpelasm.gff example_mml1.bedGrap example_mml2.bedGraphh example_nme1.bedGraph example_nme2.bedGraph example_pdm.bedGraph example_tmml_null.bedGraph example_tmml_pvals.bedGraph example_tnme_null.bedGraph example_tnme_pvals.bedGraph example_tpdm_null.bedGraph example_tpdm_pvals.bedGraph

One of the samples I am testing CPEL on is actually validated to be skewed for XCI, so I would expect to see significant pvalues for MML and PDM atleast. But the *_pvals.bedgraph files dont get generated and the following 3 files are empty _tmml_null.bedGraph

_tnme_null.bedGraph _tpdm_null.bedGraph Do you have any insight on what is influencing the pvalues ? Also does CPEL only calculate ASM in regions around the SNV's or also uses other regions that dont have an SNV surrounding it by extending the haplotype within a region with an SNV overlapping CpG sites ? I am not sure what I need to change upstream to get the right output.

nmfad commented 5 months ago

Hi Jordi,

A follow up question for you, I am looking at the toy example directory within the CPEL. I am seeing that inside the fasta folder , there are 3 directories .../CpelAsm/uD5CJ/test/fasta 1) n-masked - which I understand is the n-masked reference generated using the same using SNP_split_genome_preparation steps.

2) allele1 - There is an allele1 fasta with the bisulphite genome from bismarck

3) allele2 - There is an allele2 fasta with the bisulphite genome from bismarck

Can you please explain at what step are you aligning allele1 and allele2 reads independently to bismarck ?? I was of the notion that the BISMARCK step is only run using the N-masked reference to align all WGBS reads within a sample, then SNPsplit is used to split the resultant bismarck generated bam into allele1.bam and allele2.bam. Are you aligning allele1.bam and allele2.bam generated from SNPsplit again using bismarck ?? Can you please explain how the files under the following directories are generated .../CpelAsm/uD5CJ/test/fasta/allele1 .../CpelAsm/uD5CJ/test/fasta/allele2 .../CpelAsm/uD5CJ/test/bam/example.a1.bam .../CpelAsm/uD5CJ/test/bam/example.a2.bam

Maybe I am not seeing significant pvalues because there is something I doing incorrectly in the process or I may have misunderstood the methods section of your paper. If you can explain the above, that would be very helpful!

nmfad commented 4 months ago

Hi Jordi,

Apologies to bother you , I was able to figure out the following 2 things and that seems to have partially solved some of my issues (No pvalue files getting generated and test statistic bedgraph files are empty). Explaining them below incase someone else here needs it.

1) The run_analysis(b1,b2,b1,vcf,fa,out;win_exp=10,n_null=1000,cov_ths=5) step mentioned in the readme instructions is missing the unassigned_bam argument and a few other arguments which I was able to find on the "Natcomms" branch. The analysis that worked and was eventually able to generate pvalues is as follows. run_analysis(bam1, bam2, bam_unassigned, vcf, fa, out;g_max=g_max, cov_ths=cov_ths, win_exp=win_exp, trim=trim, n_null=1000,n_max=25) I set g_max = 500, cov_ths = 8 , win_exp = 100 and trim = (5,5,5,5). Wanted to let you know so anyone else using the instructions on the main page runs into the same issues or maybe the instructions on the main page can be updated to reflect the inclusion of the unassigned_bam.

2) I noticed that when running the script outside of the Julia compiler on the shell command line as a julia , one needs to have julia in their path and secondly, the commandline will only work if the script is run from within the cpel_install directory. I used the following lines of code in my Githubissues.

Githubissues is a development platform for aggregating issues.

jordiabante / CpelAsm.jl

No significant T-MML, T-NME and PDM-haps found in samples validated for X-skew testing #11