Closed svengato closed 3 years ago
The results are macrosynteny matches on the species 2 target chromosome: a table of (chromosome, i, j, fmin, fmax, strand), where fmin, fmax are the base pair range on the species 2 chromosome and i, j are the gene indices from species 1.
The (i, j)th genes from the species 1 annotations table identify the matching blocks (chromosome, fmin, fmax) for species 1.
Finally, we assign each (i, j) a color and display the blocks in the charts. For species 1 there is a distinct color for each chromosome, while in species 2 they are somewhat haphazard, matching the related block from species 1.
Alan added a Levenshtein distance metric to pairwise-blocks.
Possible future metrics we discussed include Jaccard index and Kendall's tau.
In the (Highcharts) charts, I set the zIndex of each block to (fmin - fmax) to put the shortest blocks on top, however we could put those with the least Levenshtein distance on top.
What should the default values of matched, intermediate, mask be? (currently 20, 10, 10)
We often lack results for a few species 1 chromosomes. Alan says that the pairwise-blocks service chokes if you send it too many requests simultaneously, and that switching to GCV microservices should fix the problem.
To make a long story short, I got a local GCV microservices instance working (through Alan's Docker container), and switched to legfed_v1 gene families to test it. The relevant microservice is macro-synteny-blocks.
Next: get it running on dev-legfedorg. ("No space left on device" error)
Expanded the dev-legfedorg virtual machine, then built the GCV microservices there. (This was last week, 25 Feb)
Checked in GCV microservices and macro-synteny blocks (to the dev-legfedorg branch). A few notes:
The organism files point to the GCV microservices at http://localhost:6427/gcv (6427 is MICR on a telephone). For now, we will use the legfed gene families, though (a) Arabidopsis thaliana will not have gene families, and thus no micro- or macro-synteny results, (b) the peanut and pigeonpea GCV chromosome name formats are slightly different for this GCV instance and the old one, so when we make a phytozome-friendly repository I will have to remember to change them back (as well as the new GCV microservices URL). arahy.Tifrunner.gnm1.Arahy.01 v. arahy.Arahy.01 cajca.Cc01 v. cajca.CcLG01
This is the macro-synteny blocks layout we have been using. Next: give macro-synteny its own chart, then I have some ideas of how to better tile the blocks, such as y ~ Levenshtein distance (probably with smaller distances on top). Would it be sufficient to combine the two species 2 orientation tracks? You can mouseover the block to view its orientation.
The intro.js tour now includes the macro-synteny settings. To do: it should include the separate macro-synteny chart as well when that is ready.
servicesAPI.R is now obsolete, replaced by gcvMicroservices.R (and one function moved to zChart.R where it belongs), so I removed it from the repository but kept a local version that (inefficiently) supports macro-synteny.
Commit 1afe544:
Macro-synteny results now have their own charts. Users may display them as Levenshtein or Jaccard distance.
Levenshtein distance (on the chart) is normalized by the number of genes in the block (really the average number of genes in the species 1 and species 2 blocks). Alan corrected the Levenshtein distance calculation, which was not returning all expected results due to a maximum recursion depth.
Jaccard distance can be for 1- or 2-grams, the latter optionally including reversals.
Continuation of issue #26, which started from confused premises.
We will get this information from the Services API v2 pairwise-blocks service, using the ordered list of gene families from each chromosome of species 1 as reference chromosomes, the list of chromosomes from species 2 as the target chromosomes, and (matched, intermediate, mask) values from the ZZBrowse user interface. In other words, every time the user changes either species, we must call pairwise-blocks for each species 1 reference chromosome, with a list of all species 2 chromosomes as its targets, and combine the results.
We will display the results as tracks on the Whole Genome view gChart and the Chromosome view pChart. The blocks on species 2 require two tracks, for + and - orientation relative to the species 1 blocks (which are forward by definition and need only one track).