nlesc-ave / ave-rest-service

visualize (clustered) single-nucleotide variants across genomes
Apache License 2.0
1 stars 0 forks source link

The haplotypes endpoint returns duplicated data #34

Open sverhoeven opened 7 years ago

sverhoeven commented 7 years ago

The variants between haplotypes are repeated except the genotypes field.

For example

{
  "haplotypes": [{
    "accessions": [],
    "haplotype_id": "xxx",
    "sequence": "XXX",
    "variants": [{
      "chrom": "chrX",
      "position": 1234,
      "genotypes": [{
         "accession": "acc1",
         "genotype": "[1, 1]"
      }]
   }]
  }],
  "hierarchy": {}
}

The variant is repeated for each haplotype, with each haploytype having a different genotypes value. We should pull the variant object into variants map in the root of the response. We could change this to:

{
 "haplotypes": [{
    "accessions": [],
    "haplotype_id": "xxx",
    "sequence": "XXX",
    "variants": [{
      "variant_id": "varXXX",
      "genotypes": [{
        "accession": "acc1",
        "genotype": "[1, 1]"
       }]
   }]
  }],
  "hierarchy": {},
  "variants": {
    "varXXX": {
      "chrom": "chrX",
      "position": 1234
    }
  }
}

This should reduce the JSON response in size and make the server side Python 2 JSON conversion faster.

sverhoeven commented 7 years ago

A quick test for file size converting current json to new schema:

./haplotypes.orig.json  1647926
./haplotypes.packed.json    724572

This should make the json encode quicker.