graph-genome / component_segmentation

Read in ODGI Bin output and identify co-linear components
Apache License 2.0
3 stars 4 forks source link

v11-16: bin2file.json format changes #16

Open josiahseaman opened 4 years ago

josiahseaman commented 4 years ago

IMPORTANT: The current version of the JSON with additional commentary in on graphgenome.org/output_format.html
Each time we add a new feature to our JSON files, we update matrixcomponent/init.py: JSON_VERSION. This is checked in Schematize processArray() file parser to make sure our standards are consistent.

v11: Adds FASTA files in chunks (#11)

Adds a new optional field "fasta" which should only be used on the smallest zoom levels. Each chunk file will have a corresponding fasta_file listed in "files":

    "files": [
        {
            "file": "chunk00_bin100.schematic.json",
            "fasta": "seq_chunk00_bin100.fa",
            "first_bin": 0,
            "last_bin": 26
        },

v12: Bin ranges

PR 37

v13: Adds zoom directories (#12 )

v12 is a major restructure, changing bin_width specific entries into a list of entries ordered by powers of 10 (provisional).

{
    "json_version": 12,
    "pangenome_length": 52441,
    "zoom_levels": {
        "1": {
            "bin_width": 1,
            "last_bin": 52441,
            "files": [
                {
                    "file": "chunk00_bin1.schematic.json",
                    "first_bin": 1,
                    "last_bin": 2600
                },
                {
                    "file": "chunk01_bin1.schematic.json",
                    "first_bin": 2601,
                    "last_bin": 4400
                },
                {
                    "file": "chunk02_bin1.schematic.json",
                    "first_bin": 4401,
                    "last_bin": 5400
                }
            ]
        },
        "10": {
            "bin_width": 10,
            "last_bin": 5245,
            "files": [
                {
                "file": "chunk00_bin10.schematic.json",
                "first_bin": 1,
                "last_bin": 1290
                },
                {
                    "file": "chunk02_bin10.schematic.json",
                    "first_bin": 1291,
                    "last_bin": 1790
                },
                {
                    "file": "chunk05_bin10.schematic.json",
                    "first_bin": 1791,
                    "last_bin": 2140 ...

v14: X value precompute (#43)

See issue.

v15: Sparse JSON output

Detailed in #29

v16: Adds HaploBlocker row_ordering and breakpoints for phylo information (#14)

The integration of HaploBlocker versus zooming will depend on the timing issues are completed.
The bin2file.json should look like this for JSON_VERSION 13:

{
    "bin_width": 100,
    "json_version": 13,
    "last_bin": 525,
    "pangenome_length": 52441,
    "zoom_levels": [ ... ], 
     "path_names": ["6909_chr2","768_chr2", "2035_chr3","1755", "64","63","1507", "1506","1505","1504", ...],
"break_points":[
    {
        "start_bin":5,
        "end_bin": 2000,
        "row_order": [ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
        } ,
    {
        "start_bin":2001,
        "end_bin": 4560,
        "row_order": [ 16,17,3,4,5,6,9,12,1,2,7,8,10,11,13,14,15]
        } ,
] }

In addition, the "path_names" will no longer be listed in the chunkXX.json files. This would now be redundant information. The row ordering of list contents in chunkXX.json files will remain the same. Specifically, they will match the order listed in bin2file.json "path_names".

Note: This does not include "hb_library" information (see v14). We may end up removing "occupants" or making "participants" sparse at a later date, since these are simply large expanded precomputes for display convenience. But that change is not v13.

v16: HaploBlocks hb_library

Inside of each chunkXX.json will be a list of haploblocks and which bins are included in each haploblock. This should not alter

{"hb_library":[
{"first_bin":[3],"last_bin":[140],"included":[1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,32,33,34,35,36,37,38,39,40,41,42,43,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,71,72,73,74,75,76,77,78,79,80],"color":["#D95F02"]},
{"first_bin":[141],"last_bin":[155],"included":[1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,32,33,34,35,36,37,38,39,40,41,42,43,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,71,72,73,74,75,76,77,78,79,80],"color":["#1F78B4"]},
{"first_bin":[156],"last_bin":[160],"included":[1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,32,33,34,35,36,37,39,40,41,42,43,45,46,47,48,49,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,71,72,73,74,75,76,77,78,79,80],"color":["#DECBE4"]},

These blocks allow us to paint the matrix with Haploblocks. This may not be necessary for COVID-19.