hrbrmstr / drill-sergeant-rstats

📗 A Little Book About Using Apache Drill and R
https://rud.is/books/drill-sergeant-rstats/
21 stars 2 forks source link

data request (if possible) #1

Open hrbrmstr opened 6 years ago

hrbrmstr commented 6 years ago

@cboettig / @benmarwick y'all wldn't have some sample JSON I can use, would you? I'm technically not allowed to put "alot" of work-work JSON data out in the wild since it can enable attackers (it saves them the $ of doing recon scans, at least).

It'd also help me direct (what I think will be chapter 7/recipe 6) more specifically for your needs.

no worries if not. I'll either convert some data, ask for work-forgiveness (er, I mean, 'permission') or go on a JSON data hunt or use some CVE data that isn't confidential but may not be the best example of JSON to help non-cyber folks.

hrbrmstr commented 6 years ago

So https://rud.is/books/drill-sergeant-rstats/reading-a-streaming-json-ndjson-data-file-with-drill-r.html is a boilerplate recipe that's a bit more involved but may help and i can add more recipes for other examples if needed.

cboettig commented 6 years ago

@hrbrmstr Sure thing! A bunch of example JSON files here: https://gitlab.carlboettiger.info/cboettig/supertreebase/tree/master/json

These are JSON-LD representations of phylogenetic trees originally published in XML formats in the public scientific repository http://treebase.org, all CC0 / public domain.

hrbrmstr commented 6 years ago

ty!

hrbrmstr commented 6 years ago

ZOMGOSH THOSE ARE PERFECT!

benmarwick commented 6 years ago

I'm afraid I'm mostly using CSV and TSV files, so I'm very grateful to see chapter 4!

hrbrmstr commented 6 years ago

@benmarwick If you have some specific ones that are share-able, I can make topic-specific recipies as well.

benmarwick commented 6 years ago

Thanks, most recently I've been working with these https://dumps.wikimedia.org/other/pagecounts-ez/merged/2012/2012-12/, and wondering if drill might make it easier to work with. As they are, those files a bit impractical for an example. How about I get a small excerpt from one of those and share it here?

hrbrmstr commented 6 years ago

@benmarwick take a look at https://rud.is/books/drill-sergeant-rstats/working-with-custom-delimited-format-files.html and lemme know if that's tracking towards "helpful". Dealing with that last column will require a bit of Java work (to define a UDF - user defined function), but I was going to cover that anyway and this is a nice example for it. And, it's not as scary as it sounds (if it does, indeed, sound, scary :-). Most Drill UDFs are really simple Java functions based on a template that's easy to modify.

hrbrmstr commented 6 years ago

https://rud.is/books/drill-sergeant-rstats/writing-simple-drill-custom-functions-udfs-for-field-transformations.html now has the Drill UDF necessary to make the last column more usable.

hrbrmstr commented 6 years ago

@cboettig What are some "typical" operations one wld be performing on said phylogenetic tree data? I was able to tease out the "tree" but this is one area I've not handled enough SO questions on to be familiar with the data enough to whip up examples (yes, I may answer SO questions both to help folks and to try to get a handle on other disciplines at the same time :-)

SELECT
  version, id,
  b.tree.node AS nodes,
  b.tree.edge AS edges
FROM (
  SELECT 
    a.version, 
    a.`@id` as id,
    FLATTEN(a.trees.tree) as tree
   FROM dfs.supertreebase.`/S100.json` a 
   LIMIT 10
  ) b

image

cboettig commented 6 years ago

Great question. Common tasks might be: