Closed szhan closed 1 year ago
Splitting individuals into reference panel and target cohort should probably be done in a separate notebook, so as to avoid processing the original unified genealogies again.
Also, running BEAGLE
should be done in a separate bash script. No need for a Jupyter notebook.
So, the notebook collection should be:
tskit.lshmm
.There should be a separate notebook to compare the imputed genotypes from BEAGLE
and tskit.lshmm
with the true genotypes.
So, there are five notebooks in total:
prepare_dataset_1.ipynb
. Download unified genealogies and filter down to high-coverage individuals.prepare_dataset_2.ipynb
. Split the individuals into ref. panel and target cohort.prepare_dataset_3.ipynb
. Prepare VCF files for imputation.run_lshmm.ipynb
Imputation using tskit.lshmm.compare_lshmm_beagle.ipynb
. Compare the imputed and true genotypes.The notebooks prepare_dataset_*.ipynb
are complete, so this issue is done for now. Completing the other two notebooks will address #1.
I think it is useful to add a separate notebook (prepare_dataset_4.ipynb
) just for making compatible genotypes from VCF files in sgkit
for easier downstream analysis.
Moved from https://github.com/szhan/tsimpute/issues/93
Right now, this one notebook does the following:
BEAGLE
.tskit.lshmm
.It is easier to divide them up into the following stages, one per notebook: