Fail to detect any IBD in data

pguarinovignon commented 1 year ago

Hi, I've been trying to run ancIBD on data imputed with GLIMPSE and the 1000G panel. I have the full vcf per chromosome, I could run the transformation and filtering to 1240k snps smoothly, calculating the af directly with the command:

for ch in range(1,23):
vcf_to_1240K_hdf(in_vcf_path = f"./data.chr{ch}.vcf.gz",
                     path_vcf = f"./ancIB_data/data.1240.chr{ch}.vcf.gz",
                     path_h5 = f"./ancIB_data/data.1240.chr{ch}.h5",
                     marker_path = f"./ancIBA_filters/snps_bcftools_ch{ch}.csv",
                     map_path = f"./ancIB_data/v51.1_1240k.snp",
                     #af_path = f"./ancIB_afs/v51.1_1240k_AF_ch{ch}.tsv",
                     col_sample_af = "AF_ALL",
                     buffer_size=20000, chunk_width=8, chunk_length=20000,
                     ch=ch). But when I run hapBLOCK_chroms

Then when I run the next command it fail to detect any IBD in any pairs among my 90 samples...

for ch in range(1,23):
    df_ibd = hapBLOCK_chroms(folder_in='./ancIB_data/data.1240k.chr',
                             iids=iids, run_iids=[],
                             ch=ch, folder_out='./ancIB_data/ibd/',
                             output=False, prefix_out='', logfile=True,
                             l_model='hdf5', e_model='haploid_gl', h_model='FiveStateScaled', t_model='standard',
                             ibd_in=1, ibd_out=10, ibd_jump=400,
                             min_cm=6, cutoff_post=0.99, max_gap=0.0075,
                             processes=1)

When I run the code with the example data I obtain the exact same result as in the tutorial.

When I use plink or KING on my imputed dataset I detect IDB (and I know for sure that some individuals are 1st degree related). Could you see where is the problem?

Here an example of the filtered vcf obtained by the first command:

21      10205629        rs140777501     A       G       100     PASS    NS=2504;AA=A|||;VT=SNP;DP=21926;RAF=0.166534;AF=0.166534;INFO=1;EAS_AF=0.2173;AMR_AF=0.1988;AFR_AF=0.1135;EUR_AF=0.2177;SAS_AF=0.1104;AN=54;AC=2        GT:DS:GP        0|0:0:1,0,0     ./.:.:. ./.:.:. 0|0:0:1,0,0     ./.:.:. 0|0:0:1,0,0     0|0:0.001:0.999,0.001,0 ./.:.:. ./.:.:. ./.:.:. ./.:
21      14601415        rs2775537       G       A       100     PASS    NS=2504;AA=g|||;VT=SNP;DP=17928;RAF=0.550519;AF=0.550519;INFO=1;EAS_AF=0.376;AMR_AF=0.4251;AFR_AF=0.8169;EUR_AF=0.4722;SAS_AF=0.5399;AN=30;AC=3 GT:DS:GP        ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. 1|1:1.998:0,0.001,0.999 ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:
21      14652908        rs3869758       T       C       100     PASS    NS=2504;AA=.|||;VT=SNP;DP=23683;RAF=0.0896565;AF=0.0896565;INFO=1;EAS_AF=0;AMR_AF=0.0648;AFR_AF=0.2814;EUR_AF=0.0239;SAS_AF=0.0082;AN=176;AC=0  GT:DS:GP        0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0

hringbauer commented 1 year ago

Hi @pguarinovignon! So based on your VCF you seem to have some missing data in there - SNPs where there is no imputed genotype information. E.g. for the first SNP: 0|0:0:1,0,0 ./.:.:.

The second individual has missing data. ancIBD cannot handle those currently, as it assumes every SNP has been imputed.

You can try the direct run on two IIDs (that you know are related), see run_plot_pair (https://ancibd.readthedocs.io/en/latest/plot_IBD.html)

That should give output similar to this one:

Missing data would make the red line go away.

pguarinovignon commented 1 year ago

Thank you, indeed I did put a filter on my glimpse output to only keep call with a GP>0,99 so the missing data. When using the direct output of GLIMPSE I had no missing data and it worked. I use this space to also ask, in the function "plot_pde_individual_from_ibd_df" is it possible to plot de theorical distribution for aunt-nephew and siblings (I suppose parents-offspring distribution is not relevant here as it is in full IBD) ?

hringbauer commented 1 year ago

Thank you, also for reporting back once it works!

The next version of ancIBD will be able to work with missing data (by just skipping it) - but using GP at all 1240k SNPs is always preferred - as ancIBD is calibrated for such data!

Congratulations on your IBD calls - and Parent Offspring are always a great sanity check - as indeed everything should be in IBD.

in the function "plot_pde_individual_from_ibd_df" is it possible to plot the theoretical distribution for aunt-nephew and siblings (I suppose parents-offspring distribution is not relevant here as it is in full IBD) ?

Yes it is! You have to update the parameters according - and change the following:

 comm_ancs =[4,4,2,2]
 ms=[4,6,5,4]
 labels=["First Cousins", "Second Cousins", "5 generations anc.", "4 generations and."]

comma_ancs should be four (as in siblings and aunt nephew the first generation has 2x2 haplotypes). For siblings you have two meiosis (ms should be 4 then) and four aunt-nephew three (ms should be three). And update the labels and maybe also colors (cs) accordingly!

hringbauer / ancIBD

Fail to detect any IBD in data #4