lifebit-ai / gwas

GWAS pipeline using SAIGE
5 stars 2 forks source link

Error when running gel-gwas with simulated data related to inconsistent number of participants between the pheno file and the VCF file #87

Closed mmeier93 closed 3 years ago

mmeier93 commented 4 years ago

The following job fails: https://cloudos.lifebit.ai/public/jobs/5fb580a1cc96a201128dc3fd

The error seems to related to an inconsistent number of participants between the pheno file (118?) and the VCF files (40). Should the pipeline not perform a subset?

mcamarad commented 4 years ago

Thanks @mmeier93 for raising this issue! 👍 Very interesting 🤔 ! after looking at it I saw that this issue is more related to SAIGE than to the GWAS pipeline itself as it occurs within the SAIGE function SPAGMMATtest. The scripts you can see in the process within pipeline just call the function from the SAIGE package that behaves like this. So we have no control over the behaviour of this function. This doesn't mean we cannot work around it to ensure that it doesn't lead to errors 😉 .

Having said this, there are two main considerations:

  1. The transform_cb_output.R performs a subset at the metadata level in which if there's no platekey platekey == '' the sample is removed from the metadata, but it doesn't check the aggregated VCF file itself. This logic doesn't cover cases like the one you showed. Also, the pipeline doesn't expect to have phenotypic data for samples with no VCFs. As this should be addressed during ingestion of genotypic & phenotypic data via data validation.

    • What the pipeline actually expects is the opposite case: having more samples in the VCFs than samples in the metadata. The pipeline seems to work fine in this scenario. The test VCFs we use contain >1K samples and the phenotypic data only 100s.
  2. A check of this type would require to read an aggregated VCF file header within transform_cb_output.R, which we can do for sure, to ensure we have a 1 phenotype : 1 genotype correspondance and this process doesn't break the pipeline in cases like yours.

I will check with @cgpu 👀 . In the meantime my suggestion is that you remove all the samples from your metadata that they are not in your simulated aggregated VCFs.