Split `prepare_dataset.ipynb` into separate notebooks

szhan / onekg_analysis

Evaluation of genotype imputation methods using the unified genealogy dataset

MIT License

0 stars 0 forks source link

Split `prepare_dataset.ipynb` into separate notebooks #4

Closed szhan closed 1 year ago

szhan commented 1 year ago

Moved from https://github.com/szhan/tsimpute/issues/93

Right now, this one notebook does the following:

Download unified genealogies.
Simplify the trees down to only high-coverage individuals.
Split the individuals into reference panel and target cohort (one set of trees per group).
Prepare data objects and files (VCFs and samples) for imputation.
Impute using BEAGLE.
Impute using tskit.lshmm.

It is easier to divide them up into the following stages, one per notebook:

Steps 1 to 3.
Step 4. This involves writing to VCF and making samples compatible, but it should be soon accelerated using sgkit.
Step 5.
Step 6.

szhan commented 1 year ago

Splitting individuals into reference panel and target cohort should probably be done in a separate notebook, so as to avoid processing the original unified genealogies again.

Also, running BEAGLE should be done in a separate bash script. No need for a Jupyter notebook.

So, the notebook collection should be:

Steps 1 and 2. Download unified genealogies and filter down to high-coverage individuals.
Step 3. Split the individuals into ref. panel and target cohort.
Step 4. Prepare VCF files for imputation.
Step 6. Imputation using tskit.lshmm.

szhan commented 1 year ago

There should be a separate notebook to compare the imputed genotypes from BEAGLE and tskit.lshmm with the true genotypes.

So, there are five notebooks in total:

prepare_dataset_1.ipynb. Download unified genealogies and filter down to high-coverage individuals.
prepare_dataset_2.ipynb. Split the individuals into ref. panel and target cohort.
prepare_dataset_3.ipynb. Prepare VCF files for imputation.
run_lshmm.ipynb Imputation using tskit.lshmm.
compare_lshmm_beagle.ipynb. Compare the imputed and true genotypes.

szhan commented 1 year ago

The notebooks prepare_dataset_*.ipynb are complete, so this issue is done for now. Completing the other two notebooks will address #1.

szhan commented 1 year ago

I think it is useful to add a separate notebook (prepare_dataset_4.ipynb) just for making compatible genotypes from VCF files in sgkit for easier downstream analysis.