hringbauer / ancIBD

Detecting IBD within low coverage ancient DNA data. Development Repository for software package that contains code for manuscript.
GNU General Public License v3.0
10 stars 3 forks source link

Problems when running example data #10

Closed Han-Shi-China closed 1 year ago

Han-Shi-China commented 1 year ago

Hi, recently I was running the ancIBD-0.5 using the example data (downloaded from https://www.dropbox.com/sh/q18yyrffbdj1yv1/AAC1apifYB_oKB8SNrmQQ-26a?dl=0), using python3.8.3 with the same parameters in the tutorial. All commands can run smoothly without errors, but the results I got is different from the results showed in the tutorial. And I still have not figure out what caused this difference. I will really appreciate it if you can help!!

I checked the output when prepare the hdf5 files, and it is the same from what is in the tutorial. But results start to differ when calling the IBD, and it seems that the IBD that I got is much less.

Chromosome 1; Loaded 18 IBD Chromosome 2; Loaded 13 IBD Chromosome 3; Loaded 11 IBD Chromosome 4; Loaded 9 IBD Chromosome 5; Loaded 9 IBD Chromosome 6; Loaded 11 IBD Chromosome 7; Loaded 12 IBD Chromosome 8; Loaded 8 IBD Chromosome 9; Loaded 10 IBD Chromosome 10; Loaded 10 IBD Chromosome 11; Loaded 10 IBD Chromosome 12; Loaded 6 IBD Chromosome 13; Loaded 12 IBD Chromosome 14; Loaded 6 IBD Chromosome 15; Loaded 6 IBD Chromosome 16; Loaded 8 IBD Chromosome 17; Loaded 9 IBD Chromosome 18; Loaded 6 IBD Chromosome 19; Loaded 11 IBD Chromosome 20; Loaded 10 IBD Chromosome 21; Loaded 9 IBD Chromosome 22; Loaded 10 IBD Saved 214 IBD to /home/han_shi/script/ancibd/output/ibd_hazelton2/ch_all.tsv.

When summarizing the IBD between pairs, some of the results yield IBD but differ in sum and number, but some pairs(I12438 & I12896, I12440& I12896...), which yield some IBD in the tutorials, did not yield one IBD here..

iid1 iid2 max_IBD sum_IBD>8 n_IBD>8 sum_IBD>12 n_IBD>12 sum_IBD>16 n_IBD>16 sum_IBD>20 n_IBD>20
I12440 I30300 268.7866928288713 3316.770885130245 22.0 3316.770885130245 22.0 3316.770885130245 22.0 3316.770885130245 22.0
I12438 I30300 283.65220259875065 3307.30799218436 21.0 3307.30799218436 21.0 3307.30799218436 21.0 3307.30799218436 21.0
I12440 I12438 176.73970285686664 1657.2198443172965 22.0 1647.3969392536674 21.0 1647.3969392536674 21.0 1647.3969392536674 21.0
I12439 I12440 153.30770015716553 1626.9729871419258 21.0 1604.7398943570442 19.0 1604.7398943570442 19.0 1604.7398943570442 19.0
I12439 I30300 98.25259447097778 919.9655823060311 13.0 919.9655823060311 13.0 919.9655823060311 13.0 919.9655823060311 13.0
I12439 I12438 93.37389469146729 475.9611673653126 10.0 459.80747416615486 8.0 459.80747416615486 8.0 459.80747416615486 8.0
I21390 I30300 13.226699829101562 13.226699829101562 1.0 13.226699829101562 1.0 0.0 0.0 0.0 0.0
I12896 I30300 12.573707103729248 12.573707103729248 1.0 12.573707103729248 1.0 0.0 0.0 0.0 0.0
I12439 I21390 11.11750602722168 11.11750602722168 1.0 0.0 0.0 0.0 0.0 0.0 0.0
I12439 I12896 8.414804935455322 8.414804935455322 1.0 0.0 0.0 0.0 0.0 0.0 0.0
I12438 I12896 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I12438 I21390 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I12440 I12896 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I12440 I21390 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I12896 I21390 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Could you please give me some advice about what might caused this problem? Many thanks!!

hringbauer commented 1 year ago

Thank you for reaching out. And congrats on apparently successfully running ancIBD.

Actually, your IBD calls look better. I found that IID I12896 in the current tutorial has erroneously too much missing data (by accident a PMD-filtered VCF was used) - and produces lots of short FP IBD segments - that you do not have. I will update the Vignette very soon to rectify that error.

Could you let me know if you started from the current example VCF dataset? I just downloaded the data from the Dropbox link, installed ancIBD v0.5 on a fresh machine, and reproduced the original results (exactly!).

A quick sanity check would be running ancIBD manually on a pair of chromosomes:

Pasted image

The Filtering to 0.99 GP variants: 0.125x tells you that only a few variants have high imputation quality. Is that the same for you?

Han-Shi-China commented 1 year ago

Really thanks for your quick response!

I tried to run ancIBD on chromosome 8 between I12440 and I12896, and got the same results as you did.

However, I still failed reproducing the results in the tutorial. So I really want to share more detailed information with you...And sorry that it will be very long, I really appreciate it if you are willing to help me check on this. Thanks!!

Figure and scripts for the quick check on chromosome 8 (consistent):

chrom8 chrom8_script

Detailed steps for running the example data

1) Preparation

Software version:

ancibd_version

Guidelines: https://ancibd.readthedocs.io/en/latest/Intro.html

Data: downloaded from https://www.dropbox.com/sh/q18yyrffbdj1yv1/AAC1apifYB_oKB8SNrmQQ-26a?dl=0 detailed information:

files1 files2

2) VCF->HDF5

Script:

vcf_hdf5

Correct: May be the path of v51.1_1240k.snp in the tutorial should be changed into : map_path = f"./data/maps/v51.1_1240k.snp"

Output HDF5 files:

hdf5

Output information: results_vcf_hdf5.txt

Notice: In output information, there is an error:

Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)

but the transformation can still go smoothly, and the other output information is the same as that in the tutorial. I checked that on GitHub and found that some people may also have the same issue while running other software smoothly. So I am wondering maybe this may not be the major cause for the different results?.... https://github.com/LinkageIO/Camoco/issues/98

3) Calling IBD

Script:

call_ibd

Output files: ch_all.txt ibd_ind.d220.txt

Output information: is the same as I previously mentioned, which is different from the tutorial...:

ibd_results1

I tried to upload the compressed files including all the files or the hdf5 files, yet the file size is too big(both are >25MB). And if there are something that I forget to mention, please contact me. Really thanks for your kind help!

hringbauer commented 1 year ago

Thank you for your detailed report!

Your Chr. 8 example shows that your pre-processing works, you got exactly the same hdf5 file as me.

I could now figure out what caused your different output - it was a look-over on our side:

We updated the code in the Vignette notebook, but not the output cells! Using the same input as you I can fully replicate your output. The difference was that you use

l_model='h5', e_model='haploid_gl2'

which is the most up-to-date module.

I now further updated the Vignette, adding the latest and recommended parameterp_col='variants/RAF'.

You can find this latest notebook (and matching output) here.

You should now be able to run it and exactly replicate its output!

Han-Shi-China commented 1 year ago

Thanks for your careful check, I can replicate the output this time!!

Also, many thanks for your excellent work in ancIBD and hapROH, which really enable us to explore more about the family pedigrees and social structure. Thanks!

hringbauer commented 1 year ago

Thanks a lot, you identified a bugged vignette file :)

We would add you to the Acknowledgments section. Let us know if that is okay and if so your preferred (user) name (in a PM or per email).

Happy IBD and ROH hunting!

Han-Shi-China commented 1 year ago

Hi, I think it would be a great honor for me to be in the Acknowledgments section! My name is Han Shi, and my email is shihan@ivpp.ac.cn. Thanks again for all your work and help~

Han-Shi-China commented 1 year ago

Hi, can I use this space to ask one more question about the reference panel files you use when doing the imputation and phasing?

I noticed that in the SI of your ancIBD paper, you said "We imputed all autosomal bi-allelic SNPs 1000 Genomes Phase 3 release using GLIMPSE using its default parameters". So I am wondering are you using the files downloaded from the following link? Take the chr1 as an example :http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz

Since in the tutorials of Glimpse, the reference panel files are in the GRCh38/hg38 genome assembly, so I am not really sure about the files that should be used to impute and phase data in hg19 genome assembly. Thanks so much!

hringbauer commented 1 year ago

Yes - we used the files from here (that you refer to) as reference for phasing and imputation, that is: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

Most of human aDNA is aligned to hg19/grch37 - so please also that matching reference panel!

Han-Shi-China commented 1 year ago

Yeah, I got it! Really thanks for your guidance :)