rounakdey / FastSparseGRM

Efficiently calculate ancestry-adjusted sparse GRM
MIT License
2 stars 1 forks source link

Inquiry on removeHigherDegree Function Logic and Guidance for Datasets with Specific PropIBD Range in FastSparseGRM #13

Closed yuyu614 closed 5 months ago

yuyu614 commented 6 months ago

1.Regarding the removeHigherDegree Function: I noticed that in the removeHigherDegree function, when the degree parameter is set to 2, the function removes relationships marked as "3rd" but not those marked as "4th". Could you provide some insight into the rationale behind this decision? I think that for degree=2, the function would remove all relationships higher than the specified degree, including both "3rd" and "4th".

2.Dataset with a Specific PropIBD Range: My dataset predominantly features PropIBD values within the 0.2-0.3 range (approximately 90% of the data), and it includes about 5000 samples with 2000 of them having kinship relations. Given this context, I'm concerned about the suitability of the current pipeline for my dataset. Could you provide any recommendations or adjustments to better accommodate datasets with such a specific PropIBD distribution? Is there a particular approach or modification to the pipeline that you would suggest for effectively handling datasets with a high concentration of related samples within this PropIBD range?

I appreciate your time and any guidance you can provide.

rounakdey commented 6 months ago

Hi,

  1. You are right. That was a silly mistake. I have fixed the removeHigherDegree function now.
  2. Those PropIBD values seems like the subjects are most likely parent-offspring or sibling-pairs. I think this pipeline should be robust to such PropIBD distributions. FastSparseGRM just uses PropIBD to estimate the structure of non-zero entries, but doesn't use the PropIBD values to actually calculate those entries. So the PropIBD distribution doesn't really affect the estimation of sparse GRM, it only matters whether those PropIBDs are higher or lower than the threshold you set through the degree parameter. Is there any reason you think it may not be suitable?
yuyu614 commented 6 months ago

Hi,

  1. Thank you very much for your prompt response. I fear I may not have been entirely clear in my initial query, so I'd like to clarify my situation a bit further. My dataset exhibits strong familial relationships; out of 5000 individuals, approximately 2000 have kinship ties. Upon utilizing the pipeline and reaching the "Extract unrelated samples" stage, I found that it yielded only 800 individuals deemed unrelated. I am concerned about how this might affect subsequent analyses. Additionally, I'm curious whether the count of unrelated individuals needs to match the real-world scenario exactly or if it's just a characteristic suitable for this pipeline's application.

  2. Moreover, I have a question regarding the calculation of genetic divergence, which uses the formula double div = ((double)(nhethet-2*nhomopp)) / ((double) (nhet[sampi] + nhet[sampj])), while the cutoff is determined by -2^-(degree+1.5). From what I understand, the degree of kinship typically considers the consistency of genotypes at the same locus, and genetic divergence assesses the proportion of heterozygotes and homozygotes between two samples. I'm interested in understanding how the cutoff is set in this context and how it integrates with the kinship determination criteria.

Thank you once again for your assistance, and I look forward to your further guidance.

rounakdey commented 5 months ago
  1. Your dataset is quite an edge case scenario indeed. Extracting the maximal set of unrelated individuals from a graph is generally an NP-hard problem. On top of that, we also want unrelated individuals to span all the population structure, that is, we want the set of unrelated individuals to represent the population heterogeneity as much as possible. Therefore, a perfect solution to this does not exist (at least to my understanding). The algorithm we used in FastSparseGRM is an intuitive and computationally manageable algorithm that works for most datasets, but can absolutely be suboptimal in edge cases like yours. The good thing is that the selected set of unrelated individuals are still guaranteed to be unrelated, and thus the downstream application is still mostly valid. The bad thing is that selecting less unrelated individuals makes the fidelity of the PCA lower because of smaller sample-size and potentially less ancestrally diverse subjects included in the PCA calculation. My intuition would be that it still should have very little impact on the final sparse GRM and any downstream genetic analysis, and the inferred number of unrelateds don't really need to match the real world scenario. You can also tweak parameters such as file.include to add more to the set of unrelated inviduals that you know to be unrelated apriori. You can also tweak the divThresh and degree parameters as well.
  2. We use this ancestry divergence interpretation following this paper: https://pubmed.ncbi.nlm.nih.gov/25810074/ Please take a look.
yuyu614 commented 5 months ago

Thank you for your guidance and the valuable insights you've shared. I appreciate it greatly.