liusihan / seGMM

A new tool to infer sex from massively parallel sequencing data.
MIT License
13 stars 2 forks source link

Very different values with WES data #2

Closed Fred-07 closed 1 year ago

Fred-07 commented 1 year ago

Hello,

I tried to use seGMM on WES data mapped on hg19. Capture kit is Twist (www.twistbioscience.com) The values obtained with seGMM are very different from the reference file 1000G.WES.txt. The sex determination does not work as seGMM reports only female with X. Example: sample_A900 0.180791 0.0252986 0.00143012 17.6898 0.12 Female X

One possibility would be to create a new reference file based on many samples. How many samples would be required to generate a new reference file? The other possibility is that the values are not correct. Could you please comment about this? Thank you

sampleid    XH  Xmap    Ymap    XYratio SRY
sample_A561 0.551688    0.0480034   0.000174466 275.145 0.02
sample_A573 0.579632    0.047957    0.000159149 301.334 0.02
sample_A640 0.162483    0.0249835   0.00145839  17.1309 0.15
sample_A641 0.603061    0.0475572   0.000179481 264.971 0.02
sample_A642 0.178988    0.0249668   0.00149554  16.6942 0.15
sample_A644 0.175031    0.0247872   0.00150849  16.4318 0.16
sample_A648 0.184823    0.0248933   0.00151391  16.4431 0.14
sample_A656 0.141264    0.0248336   0.00148154  16.762  0.14
sample_A657 0.698686    0.047548    0.000157188 302.491 0.01
sample_A658 0.176712    0.024932    0.00149047  16.7276 0.14
sample_A659 0.168956    0.0289115   0.00210318  13.7466 0.18
sample_A669 0.186164    0.0250368   0.00134935  18.5547 0.13
sample_A676 0.713023    0.0475422   0.000169815 279.965 0.02
sample_A680 0.602988    0.0475499   0.000165187 287.855 0.01
liusihan commented 1 year ago

Hi,

In seGMM paper, the 1000G WES data were downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/. You can find a summary .index file which included the path stored the exome data in http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20130502.phase3.exome.alignment.index. We collected the Xmap, Ymap, XYratio and SRY features from the downloaded bam files.

By the way, for the 1000 exon data, they used a union of two different pull-down lists: NimbleGen EZ_exome v1 and Agilent sure select v2. The targets of these data can be found in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets/20130108.exome.targets.bed. To calculated XH, we download the ALL.chrX.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz and extracted SNPs located in the targeted regions in X chromosome.

I see that the XH and SRY values are varied between the 1000G.WES.txt and your data. The difference in XH value may due to differences in data processing methods. But the SRY value looks strange. I would ask are all the samples you included female?

If you want to generate a new reference file, I suggest you should have at least 100 samples. And the ratio of male to female samples in the reference file should ideally be balanced.

Also, it looks like the features can clearly separate female and male samples in your data. I recommend that you can run seGMM without include reference data.

Thanks!

Fred-07 commented 1 year ago

Thanks for your reply. Following your comment about the SRY, I checked the files:

chrom   length  bases   mean    min     max
Y       59373566        6682845 0.11    0       1035
Y_region        614     31604   51.47   0       72
total   59373566        6682845 0.11    0       1035
total_region    614     31604   51.47   0       72
Sample   0.11

Should it be the Y_region that is extracted from mosdepth.summary.txt?

liusihan commented 1 year ago

Yes! We have fixed this bug in the newest version of seGMM.

Please update seGMM through conda or PIP.

Thanks

Fred-07 commented 1 year ago

Still the same problem with mosdepth.summary.txt Looking at the commits, I could not find where and when the code was modified. (I don't use pip/conda for seGMM)

liusihan commented 1 year ago

We have updated the main.py file and seGMM.r file in the GitHub page. Please find the modified code. Thanks!