matsengrp / cft

Clonal family tree
5 stars 3 forks source link

RDF data model #167

Closed metasoarous closed 7 years ago

metasoarous commented 7 years ago

@matsen and I have been talking through an idea about how we might organize data around cft and cftweb (and perhaps other projects) to facilitate interoperability.

Problem statement

Take a csv or fasta file. What does it mean? What other data does it relate to? Often, we have no idea outside of the context of the filename and the directory structure in which we found it. This organization is generally study/tool/researcher specific, and frequently changes over time. Consequently, plugging data from one project into another is often a painful and laborious process fraught with peril, and we spend our days munging and merging data around from one file into another.

What would bioinformatics look like if we could think about "all the data" as a cohesive whole, rather than a menagerie of tenuously related files? I'm inspired in this by my experience with databases, where having a single cohesive view of "all the data" together with a powerful query language gives us greater leverage over the relationships within that data. However, there are problems here:

So the problem is this: How can we think about our data as a cohesive relational whole without enduring the pitfalls of traditional databases? And furthermore, how can we leverage a solution to this problem to make the grunt work of connecting things in bioinformatics pipelines more manageable/automated?

RDF

Erick and I have been considering an adaptation of RDF (Resource Description Framework) towards this end. RDF is a cornerstone of the Semantic Web, the aim of which is to "[allow] data to be shared and reused across application, enterprise, and community boundaries". Fundamentally, RDF organizes data as global assertions of [entity, attribute, value] triples as facts about the entities they describe.

As far as bioinformatics is concerned:

All of this is fine and dandy, but adopting a new format and data paradigm doesn't come for free. As such, I've been thinking about how we can boil RDF down to it's most essential properties and make it easier to use.

Formatting

The standard formats for RDF data (see rdf-xml and turtle) are likely a hard sell for a lot of bioinformaticians. However, I believe we can describe the essential ingredients of RDF with nothing more than JSON objects and globally namespaced attributes. In python, this would map 1-1 to nested python dictionaries and lists:

# note the globally namespaced attribute keys
dataset = {
    # Our dataset structure
    "bio.dataset:id": "kate-qrs-v9-2017-12-11",
    "bio.dataset:description": "CFT data from that one study we did",                                                                                                                                                                                                                                                                                     
    "bio.dataset:params": {
        "cft.params:datadir": "/fh/fast/matsen_e/processed-data/partis/kate-qrs/v9",
        "cft.params:asr_prog": "dnaml"}}

# helper to add default ns to a dict, if your fingers are sore
dataset = with_ns("bio.dataset", {
    "id": "kate-qrs-v9-2017-12-11",
    "description": "CFT data from that one study we did",
    "params": with_ns("cft.params", {
        "datadir": "/fh/fast/matsen_e/processed-data/partis/kate-qrs/v9",
        "asr_prog": "dnaml"})})

# alternative; use splatted **kw_args, cause #!@# all those quotes                                                                                                                                                                                                                                                                                                                                               
dataset = in_ns("bio.dataset",
    id = "kate-qrs-v9-2017-12-11",
    description = "CFT data from that one study we did",
    params = in_ns("cft.params",
        datadir = "/fh/fast/matsen_e/processed-data/partis/kate-qrs/v9",
        asr_prog = "dnaml"))

import json
json.write([dataset, other_facts], "metadata.json")

Here, each map represents an entity, and each key/value pairs corresponds to an [entity, attribute, value] triple for that entity. Nested maps translate to relationships between entities.

To generally cover RDF, we also need to be able to create arbitrary relationships between objects (not just hierarchical, nested maps). Assuming the entity you want to link to has some unique identifier attribute (e.g. bio.seq.alphabet:name), you can reference that entity using a map containing that attr/value pair. E.g.

seqrecord = {
    "bio.seq:id": "19983-3",
    "bio.seq:alphabet": {"bio.seq.alphabet:name": "DNA"},
    "bio.seq:string": "AGCTGTGGCTAAGTCGAGCTGATCGGATACG"}

This solution is intuitive, simple, flexible, and expressive, and is just enough structure to get us a 1-1 mapping with traditional RDF for when that might come in handy.

Utilities / tooling

Elegant representations aside, no one is going out tomorrow to translate all of their data over to RDF triples, in any shape or form. And as much as nested directory structures of fasta/csv files create problems, they also enable a lot of powerful interactivity and automation via unix utils etc. So we need to be able to interface with and complement these typical bioinformatic workflows.

To this end, I think we could build a small set of tools for facilitating interoperability and automation:

Conclusion

On the immediate horizon, I'm looking at #166 and trying to figure out how we consume arbitrary partis data, and the above proposal would offer a much clearer path forward. As far as cft is concerned, this is already fairly similar to our present model of metadata.json files pointing to fasta, newick and svg data. Taking the steps above would only serve to simplify and empower the setup we already have, with fairly minimal investment.

psathyrella commented 7 years ago

Sounds sensible to me, although I don't think I understand it well enough to know precisely what it'll actually entail. i.e., lmk what I actually need to do when the time comes...

metasoarous commented 7 years ago

Awesome :-) I think at the interface of partis and cft (or partis and whatever), there might be room for some tools that would help produce this data. But I don't know yet how much of that would have to be worked into your end of things vs stuff in cft vs more general utilities. I think the next step would be to sketch out this data model together in a little more detail, and figure out how it maps to partis output and usage. From there it should be clearer how the pieces fall together.

metasoarous commented 7 years ago

Some pertinent resources/links:

matsen commented 7 years ago

Thanks, @metasoarous !

Others: although this seems like a bunch of formalism, etc, in a sense it's an organic outgrowth of what we already have: a collection of files that refer to each other. We need to take some step to avoid everything becoming a mess, and this is one step we could take in that direction without shoving everything in an SQL database. If people would prefer that direction say so now!

Chris, could you address:

  1. In a few sentences, what is your second best alternative. Postgres?

  2. Can you clarify the scope of

A declarative RDF language for describing how a directory structure maps to an RDF ontology, for abstracting the interface to a pipeline that doesn't use RDF.

  1. Can you sketch (only a few sentences) how this would interact with client-side responsive JS? Would all of this be loaded into some other client-side store? Would it be in a triple-store like Jena and loaded dynamically?

Thanks!

metasoarous commented 7 years ago
  1. Postgres would be a good choice for SQL database, as would SQLite, if we just wanted database as a file. We could also look at some of the NoSQL options out there, which in general gives you more flexibility in schema than SQL, though often at the cost of query expressiveness. However, Postgres actually has a JSON column type now, so you can effectively use it like a Mongo-esque document store, but still be able to create relationships between documents (and pol.is is now doing this because Mongo bites). So in the NoSQL space, it's more worth looking towards graph databases, as then you'd improve on both query expressiveness and data polymorphism. To be clear though, the nice thing about RDF is that it is a graph data specification, so it's strictly as or more expressive than and can export to all of the above when needed.

  2. The goal would be to parameterize construction of (ideally bidirectional) mappings between RDF data and directory layouts. You'd start with an RDF description of how the various entities relate to one another, and then you'd specify a directory nesting pattern referencing these relationships. That could then be interpreted as a set of instructions for either slurping data from directory structures into RDF, or spitting out directory structures and files from RDF.

  3. There are several ways to go here. DataScript is JS/client side store built for triple data (modeled after Datomic) with datalog queries and a simple pull query functionality. You could also translate into a form suitable for whatever other data store you might want to use on the client. Finally, you could simply create some HTTP API endpoints which take a query, and return JSON representation of the query results (executed by Jena or whatever), and just call from the client whenever you need data. The latter most approach does come with performance costs, but can also be useful when you want to avoid loading all data on the client (for memory/bandwidth/whatever). You can also split the different and load a client store/cache dynamically using such queries. So there's a fair bit of flexibility on this side of things.

Right now, I'm sketching out what the RDF relationships might look like. I think this will be very guiding in us getting a better handle on what the work involved.

matsen commented 7 years ago

I'm generally on board, though I saw something in a diagram yesterday that woke me up tonight! (Though I didn't really process it at the time.)

Are you proposing taking this data model all the way down to edges on trees? Trees have their own format already, and loads of software to represent them and manipulate them. Taking the data model down to this level will mean that we'll have to rewrite all that code, for no purpose.

metasoarous commented 7 years ago

The diagram in question:

image

This is by no means a final draft, so there's plenty up for discussion here.

For the record though, as with Fasta vs seq-set, the model I have in mind doesn't preclude using Newick files. This is just the "ingest" structure, if you will. So folks can continue to spit out and use newick as they wish. But if someone wants to query the data as linked RDF, this is what the relational model would look like.

The reason I made edges separate is so we can directly annotate them with information (al a #2, #159 etc), rather than having to kludge around labeling nodes with edge data. Having nodes separate from edges means we can link nodes and seqs as well. I think the goal of the RDF structure is to be as general and expressive as possible to give us maximal flexibility in what we can ingest and spit out. But please let me know if you see a problem with this. Would you rather we have a simple tree.node:children attribute and store edge information on the node?

matsen commented 7 years ago

The flexibility is a nice advantage.

Here's a question that may clarify things. Let's say that we want to download a Newick string for a tree, which is something users will want. Will you write a function to traverse this graph and build this string? Will it work by doing recursive queries? At very least, we'll want to be careful to set up the data structures so that this sort of operation can be fast.

And here's something that may argue for your approach. We will want to have heavy and light chain trees, for which certain pairs of nodes are known to be connected.

screenshot 2017-05-02 at 2 03 11 pm

If this plot doesn't make sense, ask @krdav , who is working on related things with Arman.

metasoarous commented 7 years ago

That is very cool! And yes, I see where you're going with being able to link nodes in different trees. We can certainly facilitate that. @krdav How do you presently represent (as file data) the connections between the nodes in these trees? Is it just joining on the node's name/id in separate Newick strings?

Either way we structure it (with nodes and edges or nodes and children), there's an easy to construct pull query that can grab the required data efficiently (in one pass), returning nested dictionaries that could be quickly traversed into a newick string. And while I think it might be nice to be able to dynamically construct these representations, we can also simply retain the filename or input Newick string as an attribute of the tree entity for pre-compiled access. In short, I'm not worried about performance.

krdav commented 7 years ago

We don't have an elegant way of dealing with this but for the reference you can look at this: pairs.tree.txt

The node name is the native chain (VH in this case) and heavy-light.pair is the corresponding pair.

metasoarous commented 7 years ago

There may be ongoing extension of the data model, but as of merge c38da0a all of our data is in this RDF/tripl format!