related-sciences / ukb-gwas-pipeline-nealelab

Pipeline for reproduction of NealeLab 2018 UKB GWAS
4 stars 3 forks source link

Debug PHESANT failures after move to filtered samples in inputs #30

Closed eric-czech closed 3 years ago

eric-czech commented 3 years ago

Filtering the input phenotype data for PHESANT using sample ids that pass genetic data QC results in an error like this:

... # Usual output
x630_0_0
The current variable is 630_0
x670_0_0
The current variable is 670_0
x680_0_0
[1] "-7=100"
[1] "ERROR: 670_0 Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)): undefined columns selected\n \n"
Error in testAssociations(currentVar, currentVarShort, thisdata, varlogfile) :
  object 'data_to_add' not found
In addition: Warning message:
In if (u < 0) { :
  the condition has length > 1 and only the first element will be used
[1] "-7=100"
[1] "ERROR: 670_0 Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)): undefined columns selected\n \n"
Error in testAssociations(currentVar, currentVarShort, thisdata, varlogfile) :
  object 'data_to_add' not found
In addition: Warning message:
In if (u < 0) { :
  the condition has length > 1 and only the first element will be used
Warning message:
In if (u < 0) { ... :
  the condition has length > 1 and only the first element will be used

Dumping this environment and debugging it shows that the data for the failed field (670) may not contain one of the values in the encoding:

  Browse[1]> table(thsidata$currentVarValues)
  Error in table(thsidata$currentVarValues) : object 'thsidata' not found
  Browse[1]> table(thisdata$currentVarValues)

      -7     -3      1      2      3      4      5
     748    172 331401  32214    500    863     43

Field: https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=680 Encoding: https://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=100287

eric-czech commented 3 years ago

The whole file finished processing overnight and this error only occurs for two fields:

gsutil cat gs://rs-ukb/prep/main/log/phesant/phesant.log | grep -C 10 Error
[1] "ERROR: 670_0 Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)): undefined columns 
[1] "ERROR: 1220_0 Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)): undefined columns

There are no obvious differences between them and the second field (1220) has at least one occurrence of all values in its data coding (so that theory is wrong).

These edits are being used to ignore these errors: https://github.com/eric-czech/PHESANT/commit/05997a79c734a0706f7622e8c9c734984f1da130

The rest of the data appears to be fine so these two fields will be omitted going forward.

hammer commented 3 years ago

Maybe worth asking about these two fields on the UKB or PHESANT mailing lists? A quick glance at 670 and 1220 makes me wonder if the use of punctuation in the coded "meaning" might be an issue? Not critical though!