Group file sorted differently in SMMAT vs. SMMAT.meta

g3png commented 2 years ago

Dear Han,

We are running a large meta-analysis and have collected intermediate files from several cohorts. I realised however that SMMAT.meta fails at the following check at specifically multiallelic sites, despite ensuring all cohorts use the same group file.

if(any(sort(tmp.scores$idx)!=tmp.scores$idx)) {
        cat("In some", meta.files.prefix[i], "score files, the order of group and variants is not the same as in the group-sorted group.file.\n")
        stop("Error: meta files possibly not generated using this group.file!")
        }

An example of where this fails (for a single cohort) is:

  group chr      pos ref alt    N missrate      altfreq      SCORE       VAR
1:  A1BG  19 58409184   C   T 1586        0 0.0022068096  0.1737194 6.9899839
2:  A1BG  19 58409184   C   G 1586        0 0.0003152585 -0.6138912 0.9923992
        PVAL idx                              file
1: 0.9476113 823 prefix.score.1
2: 0.5377377 822 prefix.score.1

In this case index 823 comes before 822 which causes the error. I am guessing this is because SMMAT did not initially order variants according to ALT alleles at multiallelic sites.

Is there any way around this?

Edit: I have just read about the issue here regarding SMMAT being designed for biallelics. Would love to know what you think anyway, and if there are (near) future plans to include multiallelic variants.

Thanks for your help in advance,

Grace

hanchenphd commented 2 years ago

Hi Grace,

Thank you for your interest in SMMAT! I have not seen this issue before, but I guess the problem was probably because this tri-allelic marker was ordered differently in the GDS file and the group definition file. In SMMAT (which uses the GDS file to generate meta-analysis files), the variants are sorted based on the variant.id. In SMMAT.meta, since we assume no access to individual GDS files, we could only sort them based on chr and pos. For tri-allelic markers with the same chr and pos, it is possible that the order is different in the GDS files (not necessarily alphabetical).

If that was the case, the easiest solution would be to use a group definition file with variants in the same order as your GDS files. For example, if your C/G is before C/T in your group definition file, but C/T is before C/G in the GDS files, you might be able to fix the problem by switching C/G and C/T in your group definition file, without having to ask each cohort to rerun. Please let me know if it does not work.

Best, Han

g3png commented 2 years ago

Thanks Han for the quick reply!

I expect it will be complicated if different cohorts have multiallelic variants ordered differently in their GDS files… but so far we only see this issue with one cohort. I will go with your suggestion and update you on how it goes.

Best wishes, Grace

On Tue, 19 Jul 2022 at 18:05, Han Chen @.***> wrote:

Hi Grace,

Thank you for your interest in SMMAT! I have not seen this issue before, but I guess the problem was probably because this tri-allelic marker was ordered differently in the GDS file and the group definition file. In SMMAT (which uses the GDS file to generate meta-analysis files), the variants are sorted based on the variant.id. In SMMAT.meta, since we assume no access to individual GDS files, we could only sort them based on chr and pos. For tri-allelic markers with the same chr and pos, it is possible that the order is different in the GDS files (not necessarily alphabetical).

If that was the case, the easiest solution would be to use a group definition file with variants in the same order as your GDS files. For example, if your C/G is before C/T in your group definition file, but C/T is before C/G in the GDS files, you might be able to fix the problem by switching C/G and C/T in your group definition file, without having to ask each cohort to rerun. Please let me know if it does not work.

Best, Han

— Reply to this email directly, view it on GitHub https://github.com/hanchenphd/GMMAT/issues/46#issuecomment-1189252825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVGC5P5EOKEZK4CNTTMSZQ3VU3G5BANCNFSM54AMQUQQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Best wishes, Grace

anh151 commented 4 months ago

Hello. I had the same issue when trying to combine 2 cohorts. I tried everything to get things in the right order and I feel like there is a bug in SMMAT.meta when it attempts to sort the groups. If we're assuming the order is set by the GDS, why would SMMAT sort alphabetically?

I tried alphabetical. I tried combine the variant positions, outputting a GDS then using that order. I tried running a fake dataset and using the outputted scores file. None of them worked. I ended up just dropping mulitiallelic positions.

Thanks, Andrew

hanchenphd commented 2 months ago

Hi Andrew,

Have you tried fixing the order in your group definition file (instead of the order in GDS) as I suggested above? If you could send me a small reproducible example, I am happy to take a look.

Thanks, Han

youngchanpark commented 2 weeks ago

Hi Han,

I haven't tried your suggestion on changing the order of variants in the group file, but wouldn't there also be a possibility that the issue may be fixed for one cohort, but the same error occurs on another cohort?

I'm still trying to wrap my head around how the analysis is being performed in the SMMAT.meta function so this might be a stupid question, but is there a fundamental reason for why the variants needs to be strictly ordered to perform the analysis?

I understand we need to "know" the order of variants because we're dealing with score files and covariance matrices across multiple cohorts that may have different sets of variants. But when it comes to a point of collecting the variants from the score files and the covariance matrices for each groups to run the meta-analysis, what's the reason for needing the files across different cohorts to conform to the order of variants in the group file?

Best wishes, YC

hanchenphd commented 2 weeks ago

Hi YC,

That's a very good question. SMMAT.meta does not assume access to individual GDS files, so variants in the score and covariance files need to be sorted in someway. If variants are not sorted, the only way to go back in the score file (which is a plain text file) is to close and reopen it, and it could be even more complicated if tri-allelic variants happen to be chunked in different score files. Sometimes tri-allelic variants could be sorted differently in different studies (maybe during the VCF -> GDS conversion), and this is a tricky situation to harmonize across studies.

Best, Han

youngchanpark commented 6 days ago

Hi @hanchenphd,

I have thoroughly gone through the SMMAT and SMMAT.meta code and I think I now can confidently say this issue is fixable without needing to know the order of variants in the individual cohort GDS files.

Both the single-cohort and meta-analysis are performed per-group. When running the meta-analysis, before performing the calculations, what ultimately happens is you read the score file and covariance matrix files across all cohorts and create a large score vector (U) and covariance matrix (V). The score and covariance value for each of the variant are combined in the combined score vector (U) and covariance matrix (V), respectively. These are then later computed for the meta-analysis.

The current code relies heavily on indexing to match variants across the group file and per-cohort summary statistics and covariance matrix. Because of the reliance on indexing to align variants, it appears you had added the check of whether the variants in the score file followed the order of variants in the group file.

When running SMMAT for the single-cohort analysis, the per-group analysis results (variant scores and covariance matrix) were appended to the output .score and .var file. Since they were appended, and the order of variants in the covariance matrix follows the order of variants in the corresponding score file group, we know which cell in the covariance matrix corresponds to which covariance value for the combination of variants.

If we change the code to match on variant ID instead of relying on indices, I believe we can make the analysis work without needing to have access to the individual GDS files to confirm the order of variants. I’m not yet 100% certain, but even now with the current indexing-based code, I think it’s okay to remove that check as well.

I hope my explanation made sense 😅. I wanted to share my thoughts to confirm with you whether I’m correct.

I am going to work on adapting the code to match on variant ID instead of indexing so I can run my meta-analysis, because I cannot ask our collaborators to rerun all of their analyses😅. I'll let you know how this goes.

Best wishes, YC

hanchenphd / GMMAT

Group file sorted differently in SMMAT vs. SMMAT.meta #46