halasadi / ibd_data_pipeline

4 stars 1 forks source link

How do we use the pipeline? #1

Open halasadi opened 6 years ago

halasadi commented 6 years ago

We are unclear on is how to move from phased haplotypes -> ibd tracts -> ibd sharing matrix. We understand the latter is what’s required to run the program (as a .sims file). We also know you provide a pipeline for converting beagle output, but the thing is that we can’t find documentation on how to actually run the scripts. Three major questions about this:

what is the initial input file?

Is there a config file format?

What is the order of running the scripts?

halasadi commented 6 years ago

What is the order of running the scripts?

I agree, the files I provide do not give clear instructions on how to call IBD segments for those not familiar with Snakemake.

To answer your questions, it is important to first point out that Snakemake works similar to Make. The Snakefile can be found here. In general, you can follow the order of steps in the pipeline by tracing back the Snakemake rule calls.

In this case, they are

1) break_up_plink 2) convert_to_vcf 3) callIBD 4) relabelIBD 5) merge_ibd_segments 6) create_sparseregions 7) remove_sparse 8) convert_to_cM 9) remove_gaps

This leads up to another question of yours

What is the initial input file?

The input to the first step (or "rule" in Snakemake language) will be the input to the rule break_up_plink, which are are plink files (3 files .bed,.bim, .fam) of the dataset you wish to call IBD segments on.

Alternatively, instead of Snakemake, you can call these rules separately by running the respective individual Python files, and calling Beagle directly. Fortunately, Beagle has good documentation here

Here is a more bare-bones pipeline that does not require Snakemake:

(1) call IBD on your data with Beagle (see Beagle instructions on how to do that)

(2) convert lengths from basepairs to cM by assuming a recombination rate of 1e-8. In my analysis, I saw a little difference between using a recombination rate map versus simply assuming a constant recombination rate.

This will get you almost there. Afterwards, you can remove IBD segments that are located near the centromere (which is what effectively remove_sparse.py does).

Is there a config file format?

Yes, there is a config file for my Snakemake file here It simply specifies a few variables that I call in my Snakemake file. Alternatively, I could of placed them on top of my Snakemake file.

Please let me know if you have any more questions.

Side-note: I should point out that in my paper, I do not call IBD segments myself on the empirical data. I use the IBD calls from here to maximize power and minimize false positives.