HaploBlocker Outputs available for Schematize

josiahseaman commented 4 years ago

[ ] Haploblocker - creating outputs JSON
[ ] component_segmentation
- [ ] row ordering listed in bin2file.json (python)
[ ] Schematize read in
- [ ] parallel files?

Example output placed at bottom of bin2file.json:

"break_points":[
    {
        "start_bin":5,
        "end_bin": 2000,
        "path_names":
["1976","768","2035","1755","64","63","1507","1506","1505","1504","2003","1972","46","1500","1499","45","1496","44","1589","1495","2079","2078","1631","1498","767","2070","1497","245","244", ...
]}]

1 JSON from Haploblocker --> segmentation --> chunk_files with row ordering

tpook92 commented 4 years ago

Just added the export of the Haplotype library in an additional output file: This pipeline is initialized by the script https://github.com/tpook92/HaploBlocker/blob/master/HaplotypeBreakSort_26_3.R

HB pipeline now looks the following:

Input: bins-sorted .json-file from Simon/Erik (Schematize ?!) + Number of break points one wants to detect

Output (This is just a toy example with extremely low resolution!!!): Breakpoint-file https://github.com/tpook92/HaploBlocker/blob/master/Data/chrk_ath_80samples_10kb_Chr1.w100000_breaks.json Contains: Breakpoints + Local sorting between Breakpoints

{"breakpoints":[
{"first_bin":[1],"last_bin":[214],"row_order":[1,2,3,4,5,6,7,8,9,10,11,14,16,17,18,19,20,21,22,24,25,26,27,32,33,34,36,37,39,40,41,42,43,45,46,47,48,49,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,71,72,73,74,75,76,78,79,80,12,28,35,77,15,23,29,50,13,30,70,38,81,82,31,44]},
{"first_bin":[215],"last_bin":[459],"row_order":[38,46,5,80,2,3,4,6,7,8,9,10,11,12,13,14,16,17,18,19,20,21,22,23,24,25,26,28,29,30,32,33,34,35,37,39,40,41,42,43,45,47,49,50,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,69,72,74,75,68,76,81,52,73,48,70,71,78,79,31,44,1,36,77,51,27,82,15]},
{"first_bin":[460],"last_bin":[1142],"row_order":[43,20,45,42,49,59,51,58,10,35,47,63,55,48,13,33,5,4,17,22,65,2,6,54,19,62,30,32,56,68,60,53,40,69,34,25,14,39,16,21,52,80,26,3,64,66,50,73,41,36,23,37,71,74,77,9,11,38,18,29,78,8,7,57,81,61,44,67,12,28,31,79,72,24,75,70,46,76,1,27,82,15]}]}

Haplotype LIbrary https://github.com/tpook92/HaploBlocker/blob/master/Data/chrk_ath_80samples_10kb_Chr1.w100000_hb.json Contains for each haplotype block:

Start bin
End-bin
Individual include

Suggested color (from HB visualization)

{"hb_library":[
{"first_bin":[3],"last_bin":[140],"included":[1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,32,33,34,35,36,37,38,39,40,41,42,43,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,71,72,73,74,75,76,77,78,79,80],"color":["#D95F02"]},
{"first_bin":[141],"last_bin":[155],"included":[1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,32,33,34,35,36,37,38,39,40,41,42,43,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,71,72,73,74,75,76,77,78,79,80],"color":["#1F78B4"]},
{"first_bin":[156],"last_bin":[160],"included":[1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,32,33,34,35,36,37,39,40,41,42,43,45,46,47,48,49,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,71,72,73,74,75,76,77,78,79,80],"color":["#DECBE4"]},
.
etc.
.
]}

HB visualization (optional) https://github.com/tpook92/HaploBlocker/blob/master/Data/ab80_w100000.PNG

josiahseaman commented 4 years ago

Anyela, Since row_ordering and path_names are separate, we'll want to separate this in the bin2file.json as well. there will be one key with an ordered list of "path_names" that we ensure matches the indices used by HaploBlocker. There's a second key "breakpoints" that will have a subkey "row_ordering" with all the indices. That nicely solves the redundancy filesize problem.

For the first issue, you don't need to read in the "hb_library". We'll deal with that later and it'll be more complex in a separate issue ticket.

josiahseaman commented 4 years ago

@covid19latam @superjox My experience getting component_segmentation from the current master was that it no longer matched my virtual environment. nested_dict is not in the Anaconda index but it can be pip installed. pip install -r requirements.txt should still work on Python 3.7.

josiahseaman commented 4 years ago

If we put all of the above together, the bin2file.json should look like this for JSON_VERSION 13:

{
    "bin_width": 100,
    "json_version": 13,
    "last_bin": 525,
    "pangenome_length": 52441,
    "files": [ ... ],
     "path_names": ["6909_chr2","768_chr2", "2035_chr3","1755", "64","63","1507", "1506","1505","1504", ...],
"break_points":[
    {
        "start_bin":5,
        "end_bin": 2000,
        "row_order": [ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
        } ,
    {
        "start_bin":2001,
        "end_bin": 4560,
        "row_order": [ 16,17,3,4,5,6,9,12,1,2,7,8,10,11,13,14,15]
        } ,
] }

In addition, the "path_names" will no longer be listed in the chunkXX.json files. This would now be redundant information. The row ordering of list contents in chunkXX.json files will remain the same. Specifically, they will match the order listed in bin2file.json "path_names".

Note: This version does not include "hb_library" information. We may end up removing "occupants" or making "participants" sparse at a later date, since these are simply large expanded precomputes for display convenience. But that change is not v13.

tpook92 / HaploBlocker

HaploBlocker Outputs available for Schematize #5