legumeinfo / ZZBrowse

Other
1 stars 1 forks source link

Display macro-synteny blocks in the Whole Genome and Chromosome views #30

Closed svengato closed 3 years ago

svengato commented 3 years ago

Continuation of issue #26, which started from confused premises.

We will get this information from the Services API v2 pairwise-blocks service, using the ordered list of gene families from each chromosome of species 1 as reference chromosomes, the list of chromosomes from species 2 as the target chromosomes, and (matched, intermediate, mask) values from the ZZBrowse user interface. In other words, every time the user changes either species, we must call pairwise-blocks for each species 1 reference chromosome, with a list of all species 2 chromosomes as its targets, and combine the results.

We will display the results as tracks on the Whole Genome view gChart and the Chromosome view pChart. The blocks on species 2 require two tracks, for + and - orientation relative to the species 1 blocks (which are forward by definition and need only one track).

svengato commented 3 years ago

The results are macrosynteny matches on the species 2 target chromosome: a table of (chromosome, i, j, fmin, fmax, strand), where fmin, fmax are the base pair range on the species 2 chromosome and i, j are the gene indices from species 1.

The (i, j)th genes from the species 1 annotations table identify the matching blocks (chromosome, fmin, fmax) for species 1.

Finally, we assign each (i, j) a color and display the blocks in the charts. For species 1 there is a distinct color for each chromosome, while in species 2 they are somewhat haphazard, matching the related block from species 1.

svengato commented 3 years ago

Alan added a Levenshtein distance metric to pairwise-blocks.

Possible future metrics we discussed include Jaccard index and Kendall's tau.

svengato commented 3 years ago

In the (Highcharts) charts, I set the zIndex of each block to (fmin - fmax) to put the shortest blocks on top, however we could put those with the least Levenshtein distance on top.

svengato commented 3 years ago

What should the default values of matched, intermediate, mask be? (currently 20, 10, 10)

svengato commented 3 years ago

We often lack results for a few species 1 chromosomes. Alan says that the pairwise-blocks service chokes if you send it too many requests simultaneously, and that switching to GCV microservices should fix the problem.

svengato commented 3 years ago

To make a long story short, I got a local GCV microservices instance working (through Alan's Docker container), and switched to legfed_v1 gene families to test it. The relevant microservice is macro-synteny-blocks.

Next: get it running on dev-legfedorg. ("No space left on device" error)

svengato commented 3 years ago

Expanded the dev-legfedorg virtual machine, then built the GCV microservices there. (This was last week, 25 Feb)

svengato commented 3 years ago

Checked in GCV microservices and macro-synteny blocks (to the dev-legfedorg branch). A few notes:

  1. The organism files point to the GCV microservices at http://localhost:6427/gcv (6427 is MICR on a telephone). For now, we will use the legfed gene families, though (a) Arabidopsis thaliana will not have gene families, and thus no micro- or macro-synteny results, (b) the peanut and pigeonpea GCV chromosome name formats are slightly different for this GCV instance and the old one, so when we make a phytozome-friendly repository I will have to remember to change them back (as well as the new GCV microservices URL). arahy.Tifrunner.gnm1.Arahy.01 v. arahy.Arahy.01 cajca.Cc01 v. cajca.CcLG01

  2. This is the macro-synteny blocks layout we have been using. Next: give macro-synteny its own chart, then I have some ideas of how to better tile the blocks, such as y ~ Levenshtein distance (probably with smaller distances on top). Would it be sufficient to combine the two species 2 orientation tracks? You can mouseover the block to view its orientation.

  3. The intro.js tour now includes the macro-synteny settings. To do: it should include the separate macro-synteny chart as well when that is ready.

  4. servicesAPI.R is now obsolete, replaced by gcvMicroservices.R (and one function moved to zChart.R where it belongs), so I removed it from the repository but kept a local version that (inefficiently) supports macro-synteny.

svengato commented 3 years ago

Commit 1afe544:

Macro-synteny results now have their own charts. Users may display them as Levenshtein or Jaccard distance.

Levenshtein distance (on the chart) is normalized by the number of genes in the block (really the average number of genes in the species 1 and species 2 blocks). Alan corrected the Levenshtein distance calculation, which was not returning all expected results due to a maximum recursion depth.

Jaccard distance can be for 1- or 2-grams, the latter optionally including reversals.