graph-genome / graph_summarization

Browser for Graph Genomes built with VG based on Graph Summarization to provide semantic zoom. As a user zooms in on a graph genome, the topology becomes more complex. Provides visualization for variation within a species of plant or animal. Designed to scale up to thousands of specimens and provide useful visualizations.
Other
7 stars 1 forks source link

Decide on the Technology Stack #1

Closed josiahseaman closed 5 years ago

josiahseaman commented 5 years ago

We'll need to agree on a set of technologies to build VG browser on. We need frameworks that 1) Are a good fit for the problems we need to solve and 2) team members are familiar with so we can maximize productivity. This project will necessarily involve multiple languages and will definitely contain a JavaScript layer. In each comment, please make a case for technologies you would like to use and what the advantages and disadvantages are. Once we have starting proposals, we can decide on a final intersection.

subwaystation commented 5 years ago

The only experience I have so far is data structure: C++ -> bindings: nbind.js -> server: node.js -> client: JavaScript. The question is, if node.js in general fits our needs. As it is written in JavaScript the language won't be an issue. Although, I would aim for a development in TypeScript hopefully leading to a less error prone implementation. Here is an overview of the efficiency of current web-frameworks: https://github.com/the-benchmarker/web-frameworks. The top ones are PHP, rust and ruby, in which I do not have any experience at all. Node is not under the 10 fastest ones. But if we decide to send most of our data over a WebSocket connection, that might not be an issue. Furthermore, Node would allow us to build our server from scratch and implement all the caching and server-client communication logic as we need it.

subwaystation commented 5 years ago

I think one key question we should answer first is the following: In terms of efficiency, is it necessary to access the C++ code directly via JavaScript bindings or should we access vg via the command line.

My personal experience so far is that if we use bindings, extracting subgraphes out of a graph is really fast. I just did not evaluate if this implementation effort will be necessary.

Maybe @ekg can tell us more about the performance of vg?

I can also open an extra issue for this discussion?

ekg commented 5 years ago

There are disk based indexes for .vg graphs and read sets. However, this doesn't and can't really exist for GBWT (haplotype indexes). Also, you get a lot of functionality from the .xg index that probably can't be ignored.

Loading these indexes into RAM can take a long time depending on the system, and uses something like 40G (~20 for each GBWT and xg) for the 1000GP graph. It is not possible to load these for every single query that the UI needs to make.

I think it will be necessary to expose an API that can be called by the web app and maintain these in RAM during runtime. I don't think that the specific resource requirements are too bad for typical data sets.

We can also consider building a new set of indexes that are appropriate for this context. There is some interest in doing so on top of GFA. That might trade off time and space (RAM) in a desirable way. But this will require development.

At present, I would suggest focusing effort on building out an API for a long-running server process that maintains the full succinct graph indexes in memory.

subwaystation commented 5 years ago

Thank you very much for the feedback. A well defined API is a good idea. As we will need as much speed as possible a RESTful API is not a applicable here, I think. We should aim for an API which makes use of WebSockets. So from my understanding, Erik's summarization points into the direction that we will need the C++ bindings. Else we could not store a graph index in RAM.

Just to clarify @ekg Should we only need to store the index in RAM or both the index and the graph (which is ~one magnitude smaller, right)?

josiahseaman commented 5 years ago

@ subwaystation I don't think the speed limitation on Restful API is relevant here. We need high payload, but latency is not as relevant, since the server isn't rendering. "Once established, a websocket connection does not have to send headers with its messages so we can expect the total data transfer per message to be less than an equivalent HTTP request. ... This difference will be less significant for larger payloads however since the HTTP header size doesn't change."src. For payload bandwidth, we'd look into compressing and sending binary data responses then decompressing them in Browser. That could be done with or without websockets. It can also be implemented after we get an basic product working. I could be convinced to use websockets depending on how cumbersome it is. I haven't needed them for the other one page apps I've made.

ekg commented 5 years ago

RESTful API is not a applicable here, I think

As the graph isn't changing, we could be RESTful. You are worried about performance?

ekg commented 5 years ago

Just to clarify @ekg Should we only need to store the index in RAM or both the index and the graph (which is ~one magnitude smaller, right)?

The graph is equivalently encoded in the xg index, which stores the full vg data model. You don't need to store the .vg serialized graph in memory.

The GBWT would also need to be in memory should we want to use the haplotypes in the visualization. An index of the read set would have to be held in memory should we want to display alignments.

josiahseaman commented 5 years ago

Here's the stack that I would propose based on my fullstack dev experience. I'll leave some comments on tradeoffs and options.

Proposed Technology Stack

I wouldn't use Node.js because a big motivation was to make the client and server run in the same language, which is particularly useful for starting programmers. It's a lot less useful for experienced programmers that would like to access bioinformatics libraries, which are largely in Python, R, and Perl.

subwaystation commented 5 years ago

"Once established, a websocket connection does not have to send headers with its messages so we can expect the total data transfer per message to be less than an equivalent HTTP request. ... This difference will be less significant for larger payloads however since the HTTP header size doesn't change."src.

I agree that in terms of payload, we do not necessarily need a WebSocket connection. But assuming we have ~100-1000 concurrent connections (is this an application case?) accessing the server, a server can handle these requests more efficiently with an open WebSocket connection compared to HTTP requests src.

@josiahseaman I will take a look at your stack, but I need some time to digest and reflect on it. But I think you clearly are the more experienced one here.

josiahseaman commented 5 years ago

I'm not concerned about latency, but requests per second may matter if what is being requested is unpredictable. I noticed on the page you link Python has two of the 5 top slots for requests per second.

Requests per seconds

1️⃣ (agoo-c) (c) 2️⃣ (japronto) (python) 3️⃣ (actix-web) (rust) 4️⃣ (kore) (c) 5️⃣ (vibora) (python) All 5 are C variants including Rust and Python. I was just reading up on WebSockets, which is certainly doable with Django src.

Another way we could talk about Deciding on a dev stack is matching areas of responsibility with areas of expertise. I'm mainly interested in the Data Visualization D3 component, so my preferences would be most important in the area I'm actually doing the work. If someone else wants to take over back end (maybe @6br) then they'd be the final decision maker on what technology they're using.

6br commented 5 years ago

The biggest issue with WebSocket is that users need to install a self root certificate with their web browser when they want to launch a local instance of the genome browser. The exception to that argument might be that WebSocket works better if we need one of (1) peer-to-peer connection possibly over IP masquerade, (2) much shorter latency, (3) enforced encryption, or (4) millions of short communications between servers and/or nodes. I couldn’t think of any genome browser that requires so many concurrent connections, so I would assume none of these applies to graph genome browsers.

6br commented 5 years ago

From my experience that I implemented backend server in MoMI-G, we should consider whether a graph format conversion algorithm is implemented on a backend server or vg binary/libraries. Because there is no single vg command to extract all information to visualize, I implemented MoMI-G backend to call vg binary multiple times. For example, MoMI-G backend assigns the coordinate of a path (i.e. chr2:10000) on each node, that requires to call vg binary for each path. If you are intending to run the algorithm in the backend server, I prefer to use a more easy-debugging programming language because it often requires many changes during the development. It seems enough to change the framework after algorithms are well-matured if the speed is needed.

josiahseaman commented 5 years ago

I agree with everything you're saying about performance. Now that we've laid out the requirements, I think we are going to need a heavy backend that speaks fluent VG. Using Python bindings into the VG library will allow the kind of debugging you're talking about. One advantage would be Pandas, which is a python library with very fast table manipulation, selectors, etc that's easy to use. But ultimately, we should just play to our strengths.

I just looked at your backend and noticed it's written in Rust, looks about 2,500 lines of code. Do you think that's reusable? If you want to be responsible for the backend, Rust is the right choice. I've heard good things about it, it has native C integration and it's ridiculously fast.

6br commented 5 years ago

I think that the backend server should not access vg via C++ bindings because it becomes very difficult to debug once it causes segfault; we cannot reproduce the segfault. Also, it takes a long time to compile if the backend server links to the vg library, that makes debugging difficult. Therefore, I recommend that backend server connects to the server mode of vg via socket. Doing so, we can separate our responsibility and concentrate on each work. Performance and productivity is a trade-off. Because we start from a small team, we should start from MVP that works well even if it is slow. I know C, C++, and Rust are good on performance, but no good on productivity because we should manage memory ourselves; alternatively, we should use Go, Haskell, or Scala that have a good GC system if you prefer compiler languages.

We need the following architecture:

  1. Daemonized vg server (not yet available)
  2. Backend server to talk with vg, layout graphs, and convert file formats. The backend can have also another index set of genome graph.
  3. Frontend
subwaystation commented 5 years ago

Another way we could talk about Deciding on a dev stack is matching areas of responsibility with areas of expertise. I'm mainly interested in the Data Visualization D3 component, so my preferences would be most important in the area I'm actually doing the work.

@josiahseaman If everyone focuses on their areas of expertise solely we will see the fastest implementation process, but then we would learn the least, I expect? I would also wish to do a lot of implementation in D3, but open to help out in any part of the project.

@6br About WebSockets: I would expect that we need low latency, too. From my understanding, if a user browses constantly, new subgraphs requests to the server and therefore subgraphs being sent to the client must happen sufficiently fast. So I think that we would need the best in performance, payload and latency.

About vg backend: When doing my thesis, I also noticed that the development effort with the C++ bindings was incredibly huge. So for starters and a comparably fast project progress we could leave that out and see later, if we really need to improve performance. I just don't quite understand, how such an vg server shall look like. What do we expect from it? Could you please elaborate?

About Technology Stack: I will have to wait for feedback from my collaborators @Computomics. Our initial meeting will be on 27th May. Before that I can't say something definite here. What I got so far is that we know that a rust backend would be very fast, but the implementation effort comparably high. I think if we go for python and Django, the implemenation effort is lower and as @josiahseaman stated, a lot of bioinformatics resources come with the language. How about flask or sanic as alternatives? From React downwards @josiahseaman proposed tools make total sense for me.

6br commented 5 years ago

I am interested in the development of the backend server including a novel layout algorithm for pre-rendering, a rich genome graph format, and/or additional “overview” view. About vg server backend, I wrote down a simple idea on https://github.com/vgteam/vg/issues/1343. C++ binding requires us much effort to follow continuous updates of vg. Instead, my intention is that server mode of vg and backend server are connected with a socket. By doing so, we can easily reproduce bugs or debug the intermediate output. Another option is to implement another lightweight index for genome graphs on the backend server.

subwaystation commented 5 years ago

We might also want to take a look at https://neo4j.com/developer/graph-database/. Sounds interesting, but I don't know, if they support GFA, etc. I think they focus on the representation of classical graphs. Without our genome variation graph we have a very special case.

So far our graph database would basically be a graph generated by vg.

6br commented 5 years ago

I considered the advantages and disadvantages for a few weeks. Rust is an awesome language on system programming, and we can write fast codes thanks to ownership system without considering memory leaks in general. However, if we wish to implement graph algorithms, ownership may collapse and we have no choice but using an unsafe directive. Also, due to the lack of GC, allocating and freeing small memory regions may cause memory fragmentations. Also, we cannot rescue stack traces on the production build. In other words, it requires us to have a significant burden for debugging. I suggest implementing a backend server by python flask (or golang) at first. Anyway, I wouldn't like to be a bottleneck for this project. After our algorithms are matured enough, we can port the implementation to another programming language.

ekg commented 5 years ago

I have some doubts that the neo4j backend would scale to the graph and path sizes we have while retaining efficiency for the kinds of queries we need (e.g. path position stuff). I think it will actually be pretty easy to wrap an xg index in the server logic that's needed.

On Thu, May 23, 2019, 15:34 Simon Heumos notifications@github.com wrote:

We might also want to take a look at https://neo4j.com/developer/graph-database/. Sounds interesting, but I don't know, if they support GFA, etc. I think they focus on the representation of classical graphs. Without our genome variation graph we have a very special case.

So far our graph database would basically be a graph generated by vg.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/1?email_source=notifications&email_token=AABDQEP2LAFF4ZIVGNGVICLPW2MNDA5CNFSM4HLYURF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWCHNDA#issuecomment-495220364, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEIQMGIRPQI4Y2RXRXTPW2MNDANCNFSM4HLYURFQ .

josiahseaman commented 5 years ago

@6br Thank you for carefully considering this. If given the choice between Python and GO, I would pick Python because there's a greater chance of collaborators being familiar with Python. As much as I like GO, most bioinformatics libraries are in Python. So hopefully we can leverage those as well, particularly for file format I/O. @6br How familiar are you with Python currently? It's a pretty learner friendly language.

subwaystation commented 5 years ago

@ekg thanks, so we will go with the xg index in the server logic way.

@6br Here is an overview of Python webframeworks. I think flask would give us a good start. It also supports web-sockets and is a comparatively matured piece of software.

joehagmann commented 5 years ago

Hi guys, I'm Jörg from Computomics. Thanks for all the ripe thoughts up to now.

Since I don't really have web development experience, I can't add too much to this discussion (there might be other people getting involved later on). I trust your judgements and expertise.

I just want to point out that our challenge is to build a scalable visualization that can cope with many large and complex genomes (the graph will look really messy and will have millions of small nodes, at least without smart node merging). Meaning, I assume data is basically constantly transferred from server to client (transfer of large subgraphs for efficient scrolling; loading of flanking graphs of current view; loading of pre-computed zoom levels; retrieving long-range SV information of samples which might not overlap with current view = we want to see also those paths that do not have nodes in the current view...). Do you think that's possible to efficiently achieve without implementing it asynchronously? I.e. don't we need an async python web framework?

6br commented 5 years ago

@josiahseaman Thank you for your comment. I have development experience in Python, so using Python is a good choice for me. @subwaystation Thank you for your advice. I will check the overview. @superjox Hi, Jörg, nice to e-meet you. I see your point. Do you have any recommendation framework?

subwaystation commented 5 years ago

@6br You are welcome ;) Techempower_Benchmakrs shows another great overview of web frameworks. Except in the category JSON serialization most of the frameworks implemented in Go are clearly faster than the Python ones. From a development effort point of view, I am definitely pro Python, but I am a little bit afraid, that it might be too slow for our aim?

As I see no way around web sockets, we should make sure, that the web socket implementation of our framework passes the autobahn-testsuite. For example, flask seems to be capable of doing that.

Other questions I am curious about:

josiahseaman commented 5 years ago

Josiah on Websockets

((Reposted from the Google Doc))

  1. The difference between REST and websockets for us won’t be that big.
  2. I’m open to using websockets, whatever is easiest with the other technologies. Django does websockets just fine.
  3. We’re only going to be transferring <1MB of JSON every 10-20 seconds from 2-4 requests.

From the Computer Science side, I’ll explain why websockets are a minor issue based on my 14 years of experience. “95% of the time premature optimization is the root of evil” knowing to recognize that 5% of the time it really matters is key. For us, Graph Summarization is the defining performance architecture. Full stop. If we can pull off graph summarization then we’ll be sending the same number of nodes horizontally regardless of our zoom level. If we correctly categorize haplotypes, then we can always send 20 haplotypes regardless of how many thousands of accessions there are. So it’s always 10 nodes * 20 haplotypes worth of data being sent to the client. If they scroll side ways, we just send the same amount of data again. If they want to scroll faster, they can zoom out and go to a higher level of summarization. Client side performance is constant in time and size by design. Large datasets will mean graph summarization takes longer to precompute, which affects server costs, but that has nothing to do with the client.

We will have a knob where we could turn up the complexity detail shown in the view, but from what I’ve seen with IVG we’ll run into limits in human understanding and browser SVG rendering before we run into a bandwidth / latency limiter. I would also take seriously Toshiyuki’s concern about the complexity of setting up a client side websocket certificate. Those are the kind of issues that can blow up into months of work building around a platform restriction. Though if websockets have an unsafe/ unencrypted mode we can run for localhost, then that would solve the problem.

Action Items

Toshiyuki on parallelism of backend

I assume that even UCSC Genome Browser or Ensembl doesn’t support for such simultaneous accesses for the server. If you need to boost the performance of the server (for example, in a tutorial session that more than thirty people access the backend server), just add hardware resources, i.e. scale out. So, I think we do not have to pay so much attention to parallelization in MVP. Parallelization comes along with troublesome lock or mutex, that makes codes unnecessary complicated in general.
J: All "server" frameworks, like Django, will handle parallel requests. We only have to code for parallelism if there is a single complex request.

josiahseaman commented 5 years ago

Python numpy array transfer

Ben Jeffrey, the developer of Panoptes pointed out some code that might be helpful to us for performance. @6br you might want to check this out when you have time. Panoptes is very responsive.

"Panoptes works on linear genomes only, so I'm not sure there is much you could re-use, although you might find some of the arraybuffer techniques I use for sending large numpy arrays to the browser useful.

The two files where I use this are: https://github.com/cggh/panoptes/blob/38507c7ee4cba1be277097c632ff97a866b90db6/webapp/src/js/panoptes/arrayBufferDecode.js https://github.com/cggh/panoptes/blob/1ed0906495decddb22f1911833c32c43a2b9292d/server/arraybuffer.py"

ekg commented 5 years ago

I agree that the major problem is not performance, but human understanding of the renderings we're making. This suggests that we should be focusing on rendering / layout models that are scalable and human readable. What exactly these look like isn't clear to me. I've obviously explored a linear model, but I think that has its own problems. Perhaps there is a middle road to be had, or different models to apply to different scales.

6br commented 5 years ago

Technology stacks depend on #5. If we select 1 or 2, the backend server might be light-weight (like python flask). If we select 3, then the backend server might be heavier because we need index systems for the graph. We will benefit from a database migration system seen in Django if we use RDBMS as a backend database, but however, graph data seems difficult to convert to the table data like MySQL or PostgreSQL.

subwaystation commented 5 years ago

@josiahseaman I see myself and websockets more as a functional couple! As we both have a progressive view of life, marriage is optional, but not a necessary requirement.

[Reposted from Google Doc]

I think the biggest difference between using REST and websockets is that REST will give us ~1.5 brains, whereas ws will give us 2.0 brains:

  REST (http) Websocket (ws)
Implementation Effort Very easy
Every basic web framework offers nice interfaces
Easy
Onopen
Onmessage
Onerror
Onclose
Send
Close
Communication Unidirectional
From client to server only
For each fetch a new request
Only client side caching possible
Server is not aware of client and can’t broadcast anything
→ 1.5 brains working
Bidirectional
From client to server and vice versa
Server side caching possible (e.g. the next zoom level precached)
If data changes on the server it can be broadcasted immediately
Typically used in “real-time” applications
→ 2.0 brains working
Certificate Not needed Not needed
Encryption Optional (HTTPS) Optional (wss)

Using websockets, we can go for the “1.5 brains” option first, comparable to the REST implementation. This means the server only reacts to the clients’ messages and does not cache anything. Then we could still enhance the server to the “2.0 brains” option.

When we go for ws “1.5 brains” the implementation effort will be the same compared to REST and we still can extend later easily. If we go for REST, we would have to reimplemented everything later, when we want a server side caching.

[End of Repost]

Using websockets with docker is very easy, too: https://stackoverflow.com/questions/54101508/how-do-you-dockerize-a-websocket-server

@ekg I see your point, that on the short run, we should focus on the graph summarization and not on performance. However, I think that this will become more and more important on the long run, so I want to be prepared.

josiahseaman commented 5 years ago

Since this discussion we've agreed that WebSockets can be used without encryption meaning we can run a local server without onerous setup. We'll use websockets immediately. To date, no database framework has been used. Toshiyuki is working on other projects for the next 6 weeks, so the database will be implemented by me. I'm setting up using Django simply because I'm familiar with it, it has all the features we need, and supports websockets.

subwaystation commented 5 years ago

Although, the topic is closed, I still want to give my opinion based on a brief internet research:

https://www.fullstackpython.com/websockets.html https://www.reddit.com/r/Python/comments/560gov/flask_or_django/ https://news.ycombinator.com/item?id=14690638 https://hackernoon.com/django-too-big-flask-too-small-tornado-just-right-d5d002586bbc https://www.quora.com/Should-I-use-Django-channels-or-Django-websocket-Redis-for-a-real-time-web-application-based-in-Django http://www.mindfiresolutions.com/blog/2018/05/flask-vs-django/ https://www.codementor.io/garethdwyer/flask-vs-django-why-flask-might-be-better-4xs7mdf8v

In Essence, people recommend Flask for web services and Django for web applications or web pages. In the end, we want to build a web application so I am happy with the Django decision. Also, because @josiahseaman has a lot of experience with it.