opain / GenoPred

Genotype-based Prediction (GenoPred)
https://opain.github.io/GenoPred/
GNU General Public License v3.0
65 stars 21 forks source link

Error: Less than 70% of reference variants are present in the target #91

Closed chleo28 closed 3 weeks ago

chleo28 commented 5 months ago

Dear all, I'm testing the GenoPred pipeline on my own target file, obtained using plink1.

I worked with GRCh37 reference genome version to get the plink1 files, however the pipeline is stopped because the % matching the GRCh37 is too low:

Error: Less than 70% of reference variants are present in the target Execution halted

GRCh36 match: 0.01% GRCh37 match: 26.25% GRCh38 match: 0.02%

I doublechecked some of the variants in the target.chr3.bim and in the GenoPred/pipeline/resources/data/ref/ref.chr3.pvar and they look fine to me (of course considering that plink1 bim files have A1 and A2 swapped compared to plink2 pvar format).

Where do you think is the problem here and how can I solve it?

Here the first lines of target.chr3.bim:

3 rs567712286 0 60079 G A 3 rs141398405 0 60363 G A 3 rs150273482 0 60505 G A 3 rs186057894 0 60597 G A 3 rs149416290 0 60661 G A 3 rs138423672 0 60816 G C 3 rs116650846 0 60907 C A 3 rs140713334 0 61023 T G 3 rs9756992 0 61113 T A 3 rs141767036 0 61176 G C

Do you need some other file for the check?

Many thanks for your help, Chiara

opain commented 5 months ago

Hello Chiara,

Thanks for reaching out. This error means your target data has less that <70% of the variants within the reference data (HapMap3 variants). Given these variants are typically well imputed, this error typically indicates you are using unimputed data. GenoPred required the target genetic data to be imputed already (except 23andMe format).

The part of the log file regarding genome builds is normal is not unusual. It means 26.25% of variants in your target data matched the reference data, when using build GRCh37. Typically there are many variants in the target data that are not in the reference data, since it is restricted to HapMap3 variants.

Has your data been imputed already? If so, it might be worth understanding why there are so many HapMap3 variants missing. If the HapMap3 variants can't be imputed for some reason, an alternative reference subset to different SNPs could be created.

Please send the full '.format_target.log' file.

Happy to discuss further.

Ollie

chleo28 commented 5 months ago

Dear Oliver, Thanks for your prompt answer!

The target data I'm using were previously imputed applying the 1KG Phase 3 data. I misunderstood and I thought that also the 1KG data are used as reference in the genopred pipeline.

I can try running the pipeline using the same reference applied for the imputation and then I'll let you know whether the issue is solved or not.

In the meantime, here the format_target log files. Best, Chiara

solo_test.ref.chr3.format_target.log format_target_i-solo_test-3.log

opain commented 5 months ago

By default, the reference for the GenoPred pipeline is the 1KG+HGDP individuals, but restricted to HapMap3 variants (list of typically well imputed variants).

Thanks for sending the full logs.

If the target data has already been imputed, the hapmap3 variants must have been filtered out for some reason (possibly due to low imputation quality). This could happen if the original genotype data was too sparse to impute them properly. This would be quite unusual though.

Feel free to parse on other information about how the original genotype data was collected, imputed, and any post imputation QC, and I can try and help.