graph-genome / Schematize

Visualization component of Pangenome Schematics for 1,000s of individuals and gigabase genomes.
http://graphgenome.org
Apache License 2.0
10 stars 8 forks source link

JBrowse2 Integration. #7

Open subwaystation opened 4 years ago

subwaystation commented 4 years ago

Schematize should be running as a component inside of JBrowse2. This is a separate repository at https://github.com/graph-genome/jbrowse-components which should check out Schematize repo as a submodule. All JBrowse specific code goes in jbrowse-components. Schematize development should be made compatible with mobx-state-tree wherever possible.

This is work in progress. See https://github.com/graph-genome/jbrowse-components/tree/pangenome_group_testing.

josiahseaman commented 4 years ago

The Vertical Compression needs a "Use Vertical Compression" #9 toggle. Also the mobx updates invoked by updateMaxHeight() and resetRenderStats() are not ideal since the information needed at the beginning of the render isn't available until the end of the render frame. As of 8a9d9256716fba80d61dd0366d2a15ead05e6ae4 it takes two frames to get the desired behavior. Please could you apply your newfound mobx knowledge to updating state in the right order? I'm hoping it's possible in a single pass. If not, then most likely the vertical compression heights will need to be precalculated before the render. That will require a significant refactor, so I'm hoping it's not necessary. This is also a useful use case to test how much render metadata we can put in the state tree.

josiahseaman commented 4 years ago

JBrowse1 Integration

"We've written Schematize as a React module so that it can stand alone or be loaded by https://github.com/graph-genome/jbrowse-components as a submodule. JBrowse2 will not be open to the public by the time we want to release our COVID19 Browser. We originally were planning for the long term. The main thing we want out of JBrowse is to be able to render any classic linear annotation file. If we can do that with JBrowse1, then we should proceed in that direction."

JBrowse Interaction Spec - Embedding JBrowse Snippets

Inputs: Schematize has static JSON files describing the graph genome and path positions. It also has an annotation (e.g. BigWig) for several individuals that it doesn't know how to read. Interface: Schematize asks JBrowse to render the annotation file for a particular individual from range X to Y. JBrowse returns an image or rendering element (React?). Output: Schematize places the annotation render with the context of the larger pangenome. Schematize handles coordinate conversion, alignment etc.

If this is not possible I'll write another scenario.

scottcain commented 4 years ago

Unfortunately, JBrowse 1 isn't written in React (it's Dojo--the horror). Certainly, the easiest/quickest thing to do is to embed JBrowse 1 in an iframe. It's ugly but it works. JBrowse is controlled by URL, so you can write a url that loads a given track (one BigWig per track) at the desired coordinate. Javascript can be used to update what url is being displayed in the iframe. We can also turn off some of the UI elements of JBrowse that wouldn't be of use in an iframe context, like the track chooser and (possibly, if desired) navigation elements.

A slightly nicer looking but more difficult to implement is including JBrowse in a div, where I can use javascript to make the div "act like" an image. I do this, for example, at WormBase for gene pages: https://wormbase.org/species/c_elegans/gene/WBGene00001340#-9e-3

josiahseaman commented 4 years ago

I think either of those solutions would be acceptable as a starting point. @scottcain Would you be willing to take on this issue? You've got the expertise and the experience doing almost exactly this so you're the perfect person for the job.

It sounds like we would want to create a second repo that includes jbrowse and schematize as sub-modules. I will go ahead and create a blank repo and give you administrative access to it so that we can continue to track issues in the organization.

josiahseaman commented 4 years ago

You are now commander in chief of https://github.com/graph-genome/jbrowsing_graphs. If you can see an avenue towards integrating our offering of browsing non-linear rearrangements directly in http://covid19.jbrowse.org/ or somehow merging the projects, I'm open to discuss that. The way I evaluated it, I was sure that someone would be a traditional browser for COVID19, but we've been working on a scalable graph browser because none existed. That will only matter if there's rearrangements that are biologically relevant, but we've already identified two patient in China with rearrangements in their Spike protein. So I think it's prudent to have the full graph genome pipeline online and ready in case that becomes a major factor in resistance or morbidity.

scottcain commented 4 years ago

Definitely. Here's the beginnings of a list of things I need to know:

  1. JBrowse has the concept of a reference sequence. Are all of the bigwigs created relative to a single reference, and what is that reference? If it were NC_045512.2 that would be convenient but not necessary.

  2. Are the bigwigs already available somewhere, and if so, what are the URLs and identifiers? If they are being served from somewhere that handles http range requests, that's great, they can just stay there. If not, is it OK if I put them in the S3 bucket that drives my jbrowse instance?

  3. If we are using NC_045512.2 as a reference sequence, is it OK if I just add the bigwig tracks we require to the browser at covid19.jbrowse.org?

  4. Are there any other tracks/features you'd like to see added?

My plan going forward is to get the bigwigs you need served by JBrowse from somewhere, and then outline for you the URL api for accessing those data, then we can work on incorporating iframes (since they are much easier to work with) into Schematize.

josiahseaman commented 4 years ago

Just to be clear, BigWig isn't the first or even highest priority, just an example. Reference bias haunts us. Graph genomes can take in any annotation from any assembly. The key requirement is that each annotation is clearly labeled with what assembly it is targeting. If we have that, spinning up separate JBrowses for each assembly seems doable.

I honestly don't know if non-reference annotation files exist yet, but you know they should. If we need a single reference in order to interoperate with covid19.jbrowse.org then we'd just need to liftOver any non-reference annotation. That wouldn't give me good feelings, but I know how to do it and it's a hackathon anyways. I'm also happy to let the two projects be separate deliverables and just say you're really valuable in doing this because you have experience even if they're not the same repo. Limiting ourselves to a single reference assembly neuters much of the value add of graph genomes, so it's a step taken with deep consideration first.

scottcain commented 4 years ago

I had a feeling that might be the case--I can write a script that would spin up a separate JBrowse instance given a reference sequence (and hopefully some feature annotation) and then add bigwig tracks that use that sequence as a reference. The only downside of me doing that is that I'd want to write such a script in --wait for it-- Perl, because I'm old. :-) We could continue to host the JBrowse instances at covid19.jbrowse.org, we would just need to write the url in such a way that JBrowse knows what reference sequence to use.

If that's the approach we take, I would be inclined to do all of the work in the repo for the covid19 jbrowse instance since we'd let it continue to be hosted there.

josiahseaman commented 4 years ago

Thank you for the insightful input. I think that sounds like an acceptable solution. That way we can scale from one to multiple instances as non-reference annotations became available but we can get something up pretty quickly if it turns out all of the researchers are simply aligning everything before they publish it. I haven't seen for myself but it seems a safe bet that given sequence is coming from all over the world someone somewhere is going to start disagreeing about what assembly to use.

The solution you describe still sounds like there's a fair amount of code that needs to go into both repositories in order to coordinate them. Would you also be putting signal emitting code in Schematize? ((I'm not allergic to Perl, just so long as it's not cogol or Pascaljs or whatever))

scottcain commented 4 years ago

Dunno :-) Do you envision somebody adding a new reference to Schematize and that automatically triggering the generation of a new jbrowse instance for it? If so, I'll have to think a little bit about how best to do it (and maybe chat with the other JB developers, who will no doubt try to talk me out of going down the perl road). When I was initially thinking about writing a perl script, it was to be executed occasionally by hand on the command line. If we want something automated, that's moderately (or even significantly) more complicated, as we'd have to create a environment in which the automated thingy would run (I could imagine creating a dockerfile that could create such an environment but none currently exists) and then commit the results back to the covid 19 jbrowse repo and then cause the covid19 jbrowse dockerfile to be rebuilt and restarted on it's server. That's a lot of moving pieces.

scottcain commented 4 years ago

Also, is there sample data that uses NC_045512.2 as a reference so I can start hacking something together? Or, if there is another sequence that has data associated with it, can I have that too/instead?

scottcain commented 4 years ago

Also, something that is conceptually similar to the automated spinning up of a jbrowse instance I sort of outlined above, there is http://covbrowser.org/ which was written by some people in the JBrowse group using the content of covid19.jbrowse.org as starting material. covbrowser.org is a pastebin, where the user can upload a SARS-CoV-2 sequence and the server code compares it to the reference sequence and generates a VCF (I think that's what it does--anyway, it does something to display the differences), and then gives the user a stable, sharable url. What I outlined above is similar except that it would have to do a little more, creating a new instance (presumably from a template). If that's what we wanted to do, I'm sure I could enlist the people who worked on covbrowser.org to extend it.

josiahseaman commented 4 years ago

No, that's a level of complexity not required by our spec. Graph genome creation is not trivial and we can't dynamically update them (yet!). So they'll always be run by a person like you first described. The ability to have a non-reference genome is important to us, the ability to make it update on the fly is not, since in all likelihood we'd end up scooping some bad data, ruining the graph, then showing that on our "live feed". These will always be human checked, human initiated updates.

scottcain commented 4 years ago

Ok, good to hear--now, about sample data... :-)