statgen / METAL

Meta-analysis of genomewide association scans
Other
47 stars 12 forks source link

Heterogeneity analysis fails #8

Open lindabroer-g opened 4 years ago

lindabroer-g commented 4 years ago

During meta-analysis of 18 cohorts the heterogeneity analysis fails with the following error message:

ERROR: Input file has changed since analysis started

When rerunning the same analysis with fewer cohorts it works fine. I do not touch the input files during analysis. They are gzipped and all have the exact same format (processed by EasyQC beforehand). Does anyone know what causes this error and how to fix it?

Thanks in advance

welchr commented 4 years ago

Seems like that error message occurs when either:

  1. The header row has changed (seems unlikely?)
  2. The number of markers processed in the initial analysis pass does not match the number of markers re-processed during the second scan through the file for calculating heterogeneity statistics

Variants that are being filtered out for QC issues are a possible culprit. In particular variants with ambiguous or missing alleles. The initial processing step seems to go through some effort to try and fix and/or guess in those cases, but the re-processing step might not (?)

If you know that you have a smaller set of cohorts that work, maybe you could try adding cohorts one at a time until you find a problematic one. Then see if that cohort has any warnings about bad variants/alleles/strands that might be fixable.

lindabroer-g commented 4 years ago

Thanks, I'll look into it and let you know if this fixes the problem.

lindabroer-g commented 4 years ago

Again thanks for the tips. There was something weird going on with the marker positions in one of the cohorts resulting in positions like 1.2e+07. When I fixed this everything worked fine. Though I do use trackposition, not sure why this wouldn't fail in the first processing step. In any case the problem is fixed now.

welchr commented 4 years ago

Actually I think you found the problem right there. If TRACKPOSITIONS is ON, the initial processing step does some extra checking for bad chromosomes and/or positions:

https://github.com/statgen/METAL/blob/e2253cc3901df8403a331bd725d4d9fe1edfb19f/metal/Main.cpp#L1140-L1159

The continues there cause the rest of the code to be skipped, which means the number of processed markers isn't increased.

However, the re-processing step for heterogeneity analysis does not do this same check. So it will try to analyze more markers than were originally processed.

quattro commented 2 years ago

Is anyone working on this? It would be great to have this addressed w/o the user manually fixing data beforehand.

oalavijeh commented 1 year ago

I am also getting the same error when I use trackpositions and heterogeneity together. A fix for this would be most welcome!

yningvu commented 10 months ago

I think this issue is caused by the option GENOMICCONTROL ON. When this option is on, the files will be modified in the first run. While checking for heterogeneity in the second run, the files do not match each other. I tested by turning off the genomic control option, and it worked. So, I guess it is the issue.

Sabor117 commented 5 months ago

@yningvu I suspect that the GENOMICCONTROL flag is potentially one step above TRACKPOSITIONS? Where if you disable GENOMICCONTROL you may also disable TRACKPOSITIONS?

I say this because I have run two versions of my meta-analysis with METAL recently, one with GENOMICCONTROL ON and without TRACKPOSITIONS (for posterity, this was actually run using the version of METAL you download from the METAL website, https://csg.sph.umich.edu/abecasis/Metal/download/ which is actually a version from 2011...). This version of the meta-analysis worked totally fine along with producing heterogeneity values.

I included the TRACKPOSITIONS flag in my second run of the analysis and it seems like adding that to the same analysis as the first one resulted in the issue mentioned in this thread, because the first run through of this removed a bunch of SNPs with discordant positions from the analysis.

I was mostly including TRACKPOSITIONS just so I could get chromosome + position columns in my output, so it would actually be good if this could be fixed. As it stands I may have to re-run with TRACKPOSITIONS off and manually merge the two outputs.