Closed mmeier93 closed 3 years ago
Thanks @mmeier93 for raising this issue! 👍 Very interesting 🤔 ! after looking at it I saw that this issue is more related to SAIGE than to the GWAS pipeline itself as it occurs within the SAIGE function SPAGMMATtest
. The scripts you can see in the process within pipeline just call the function from the SAIGE package that behaves like this. So we have no control over the behaviour of this function. This doesn't mean we cannot work around it to ensure that it doesn't lead to errors 😉 .
Having said this, there are two main considerations:
The transform_cb_output.R
performs a subset at the metadata level in which if there's no platekey platekey == ''
the sample is removed from the metadata, but it doesn't check the aggregated VCF file itself. This logic doesn't cover cases like the one you showed. Also, the pipeline doesn't expect to have phenotypic data for samples with no VCFs. As this should be addressed during ingestion of genotypic & phenotypic data via data validation.
A check of this type would require to read an aggregated VCF file header within transform_cb_output.R
, which we can do for sure, to ensure we have a 1 phenotype : 1 genotype correspondance and this process doesn't break the pipeline in cases like yours.
I will check with @cgpu 👀 . In the meantime my suggestion is that you remove all the samples from your metadata that they are not in your simulated aggregated VCFs.
The following job fails: https://cloudos.lifebit.ai/public/jobs/5fb580a1cc96a201128dc3fd
The error seems to related to an inconsistent number of participants between the pheno file (118?) and the VCF files (40). Should the pipeline not perform a subset?