Open hrbrmstr opened 6 years ago
So https://rud.is/books/drill-sergeant-rstats/reading-a-streaming-json-ndjson-data-file-with-drill-r.html is a boilerplate recipe that's a bit more involved but may help and i can add more recipes for other examples if needed.
@hrbrmstr Sure thing! A bunch of example JSON files here: https://gitlab.carlboettiger.info/cboettig/supertreebase/tree/master/json
These are JSON-LD representations of phylogenetic trees originally published in XML formats in the public scientific repository http://treebase.org, all CC0 / public domain.
ZOMGOSH THOSE ARE PERFECT!
I'm afraid I'm mostly using CSV and TSV files, so I'm very grateful to see chapter 4!
@benmarwick If you have some specific ones that are share-able, I can make topic-specific recipies as well.
Thanks, most recently I've been working with these https://dumps.wikimedia.org/other/pagecounts-ez/merged/2012/2012-12/, and wondering if drill might make it easier to work with. As they are, those files a bit impractical for an example. How about I get a small excerpt from one of those and share it here?
@benmarwick take a look at https://rud.is/books/drill-sergeant-rstats/working-with-custom-delimited-format-files.html and lemme know if that's tracking towards "helpful". Dealing with that last column will require a bit of Java work (to define a UDF - user defined function), but I was going to cover that anyway and this is a nice example for it. And, it's not as scary as it sounds (if it does, indeed, sound, scary :-). Most Drill UDFs are really simple Java functions based on a template that's easy to modify.
https://rud.is/books/drill-sergeant-rstats/writing-simple-drill-custom-functions-udfs-for-field-transformations.html now has the Drill UDF necessary to make the last column more usable.
@cboettig What are some "typical" operations one wld be performing on said phylogenetic tree data? I was able to tease out the "tree" but this is one area I've not handled enough SO questions on to be familiar with the data enough to whip up examples (yes, I may answer SO questions both to help folks and to try to get a handle on other disciplines at the same time :-)
SELECT
version, id,
b.tree.node AS nodes,
b.tree.edge AS edges
FROM (
SELECT
a.version,
a.`@id` as id,
FLATTEN(a.trees.tree) as tree
FROM dfs.supertreebase.`/S100.json` a
LIMIT 10
) b
Great question. Common tasks might be:
identifying all trees which contain a given otu or set of otus (think "species"; note that the "otu" given on the edge is a reference, one needs to check against the corresponding otu "label" , or more ideally, an identifier URI for said otu.
compute the evolutionary distance between two otus: identify trees containing both otus that also include length data on the edges and summing the length of edges back to the common ancestor. (A variation of this involves identifying "time trees", in which lengths on edges are scaled such that all tips are the same length from the overall root of the tree).
more pie-in-the-sky is the notion of constructing supertrees from existing trees. Some details here: https://github.com/cboettig/nexld/issues/3#issuecomment-351079229 (where we are exploring doing this via RDF/sparql, but is non-trivial. This is the example I originally had in mind which gave me the idea that drill may be a more performant / practical approach for this.
@cboettig / @benmarwick y'all wldn't have some sample JSON I can use, would you? I'm technically not allowed to put "alot" of work-work JSON data out in the wild since it can enable attackers (it saves them the $ of doing recon scans, at least).
It'd also help me direct (what I think will be chapter 7/recipe 6) more specifically for your needs.
no worries if not. I'll either convert some data, ask for work-forgiveness (er, I mean, 'permission') or go on a JSON data hunt or use some CVE data that isn't confidential but may not be the best example of JSON to help non-cyber folks.