Closed metasoarous closed 7 years ago
Sounds sensible to me, although I don't think I understand it well enough to know precisely what it'll actually entail. i.e., lmk what I actually need to do when the time comes...
Awesome :-) I think at the interface of partis and cft (or partis and whatever), there might be room for some tools that would help produce this data. But I don't know yet how much of that would have to be worked into your end of things vs stuff in cft vs more general utilities. I think the next step would be to sketch out this data model together in a little more detail, and figure out how it maps to partis output and usage. From there it should be clearer how the pieces fall together.
Some pertinent resources/links:
Thanks, @metasoarous !
Others: although this seems like a bunch of formalism, etc, in a sense it's an organic outgrowth of what we already have: a collection of files that refer to each other. We need to take some step to avoid everything becoming a mess, and this is one step we could take in that direction without shoving everything in an SQL database. If people would prefer that direction say so now!
Chris, could you address:
In a few sentences, what is your second best alternative. Postgres?
Can you clarify the scope of
A declarative RDF language for describing how a directory structure maps to an RDF ontology, for abstracting the interface to a pipeline that doesn't use RDF.
Thanks!
Postgres would be a good choice for SQL database, as would SQLite, if we just wanted database as a file. We could also look at some of the NoSQL options out there, which in general gives you more flexibility in schema than SQL, though often at the cost of query expressiveness. However, Postgres actually has a JSON column type now, so you can effectively use it like a Mongo-esque document store, but still be able to create relationships between documents (and pol.is is now doing this because Mongo bites). So in the NoSQL space, it's more worth looking towards graph databases, as then you'd improve on both query expressiveness and data polymorphism. To be clear though, the nice thing about RDF is that it is a graph data specification, so it's strictly as or more expressive than and can export to all of the above when needed.
The goal would be to parameterize construction of (ideally bidirectional) mappings between RDF data and directory layouts. You'd start with an RDF description of how the various entities relate to one another, and then you'd specify a directory nesting pattern referencing these relationships. That could then be interpreted as a set of instructions for either slurping data from directory structures into RDF, or spitting out directory structures and files from RDF.
There are several ways to go here. DataScript is JS/client side store built for triple data (modeled after Datomic) with datalog queries and a simple pull query functionality. You could also translate into a form suitable for whatever other data store you might want to use on the client. Finally, you could simply create some HTTP API endpoints which take a query, and return JSON representation of the query results (executed by Jena or whatever), and just call from the client whenever you need data. The latter most approach does come with performance costs, but can also be useful when you want to avoid loading all data on the client (for memory/bandwidth/whatever). You can also split the different and load a client store/cache dynamically using such queries. So there's a fair bit of flexibility on this side of things.
Right now, I'm sketching out what the RDF relationships might look like. I think this will be very guiding in us getting a better handle on what the work involved.
I'm generally on board, though I saw something in a diagram yesterday that woke me up tonight! (Though I didn't really process it at the time.)
Are you proposing taking this data model all the way down to edges on trees? Trees have their own format already, and loads of software to represent them and manipulate them. Taking the data model down to this level will mean that we'll have to rewrite all that code, for no purpose.
The diagram in question:
This is by no means a final draft, so there's plenty up for discussion here.
For the record though, as with Fasta vs seq-set
, the model I have in mind doesn't preclude using Newick files. This is just the "ingest" structure, if you will. So folks can continue to spit out and use newick as they wish. But if someone wants to query the data as linked RDF, this is what the relational model would look like.
The reason I made edges separate is so we can directly annotate them with information (al a #2, #159 etc), rather than having to kludge around labeling nodes with edge data. Having nodes separate from edges means we can link nodes and seqs as well. I think the goal of the RDF structure is to be as general and expressive as possible to give us maximal flexibility in what we can ingest and spit out. But please let me know if you see a problem with this. Would you rather we have a simple tree.node:children
attribute and store edge information on the node?
The flexibility is a nice advantage.
Here's a question that may clarify things. Let's say that we want to download a Newick string for a tree, which is something users will want. Will you write a function to traverse this graph and build this string? Will it work by doing recursive queries? At very least, we'll want to be careful to set up the data structures so that this sort of operation can be fast.
And here's something that may argue for your approach. We will want to have heavy and light chain trees, for which certain pairs of nodes are known to be connected.
If this plot doesn't make sense, ask @krdav , who is working on related things with Arman.
That is very cool! And yes, I see where you're going with being able to link nodes in different trees. We can certainly facilitate that. @krdav How do you presently represent (as file data) the connections between the nodes in these trees? Is it just joining on the node's name/id in separate Newick strings?
Either way we structure it (with nodes and edges or nodes and children), there's an easy to construct pull
query that can grab the required data efficiently (in one pass), returning nested dictionaries that could be quickly traversed into a newick string. And while I think it might be nice to be able to dynamically construct these representations, we can also simply retain the filename or input Newick string as an attribute of the tree entity for pre-compiled access. In short, I'm not worried about performance.
We don't have an elegant way of dealing with this but for the reference you can look at this: pairs.tree.txt
The node name is the native chain (VH in this case) and heavy-light.pair is the corresponding pair.
There may be ongoing extension of the data model, but as of merge c38da0a all of our data is in this RDF/tripl format!
@matsen and I have been talking through an idea about how we might organize data around cft and cftweb (and perhaps other projects) to facilitate interoperability.
Problem statement
Take a csv or fasta file. What does it mean? What other data does it relate to? Often, we have no idea outside of the context of the filename and the directory structure in which we found it. This organization is generally study/tool/researcher specific, and frequently changes over time. Consequently, plugging data from one project into another is often a painful and laborious process fraught with peril, and we spend our days munging and merging data around from one file into another.
What would bioinformatics look like if we could think about "all the data" as a cohesive whole, rather than a menagerie of tenuously related files? I'm inspired in this by my experience with databases, where having a single cohesive view of "all the data" together with a powerful query language gives us greater leverage over the relationships within that data. However, there are problems here:
So the problem is this: How can we think about our data as a cohesive relational whole without enduring the pitfalls of traditional databases? And furthermore, how can we leverage a solution to this problem to make the grunt work of connecting things in bioinformatics pipelines more manageable/automated?
RDF
Erick and I have been considering an adaptation of RDF (Resource Description Framework) towards this end. RDF is a cornerstone of the Semantic Web, the aim of which is to "[allow] data to be shared and reused across application, enterprise, and community boundaries". Fundamentally, RDF organizes data as global assertions of
[entity, attribute, value]
triples as facts about the entities they describe.As far as bioinformatics is concerned:
All of this is fine and dandy, but adopting a new format and data paradigm doesn't come for free. As such, I've been thinking about how we can boil RDF down to it's most essential properties and make it easier to use.
Formatting
The standard formats for RDF data (see rdf-xml and turtle) are likely a hard sell for a lot of bioinformaticians. However, I believe we can describe the essential ingredients of RDF with nothing more than JSON objects and globally namespaced attributes. In python, this would map 1-1 to nested python dictionaries and lists:
Here, each map represents an entity, and each key/value pairs corresponds to an
[entity, attribute, value]
triple for that entity. Nested maps translate to relationships between entities.To generally cover RDF, we also need to be able to create arbitrary relationships between objects (not just hierarchical, nested maps). Assuming the entity you want to link to has some unique identifier attribute (e.g.
bio.seq.alphabet:name
), you can reference that entity using a map containing that attr/value pair. E.g.This solution is intuitive, simple, flexible, and expressive, and is just enough structure to get us a 1-1 mapping with traditional RDF for when that might come in handy.
Utilities / tooling
Elegant representations aside, no one is going out tomorrow to translate all of their data over to RDF triples, in any shape or form. And as much as nested directory structures of fasta/csv files create problems, they also enable a lot of powerful interactivity and automation via unix utils etc. So we need to be able to interface with and complement these typical bioinformatic workflows.
To this end, I think we could build a small set of tools for facilitating interoperability and automation:
Conclusion
On the immediate horizon, I'm looking at #166 and trying to figure out how we consume arbitrary partis data, and the above proposal would offer a much clearer path forward. As far as cft is concerned, this is already fairly similar to our present model of
metadata.json
files pointing to fasta, newick and svg data. Taking the steps above would only serve to simplify and empower the setup we already have, with fairly minimal investment.