ipfs-inactive / archives

[ARCHIVED] Repo to coordinate archival efforts with IPFS
https://awesome.ipfs.io/datasets
183 stars 24 forks source link

Figure out how to automate test suite & allocate storage/devices/etc. #122

Closed flyingzumwalt closed 7 years ago

flyingzumwalt commented 7 years ago

Prerequisite: describe the tests that we're aiming for -- see the clarification in https://github.com/ipfs/archives/issues/102#issuecomment-275222719

Tasks:

Kubuxu commented 7 years ago

Data dump incoming:

  1. We need to check what tests we can run (either check space or user input for max space used at start).
  2. I think unique ids will be very useful as for an ease of sorting, selecting, and upgrading them when tests change.
  3. We need unified output test formant (NDJSON?), with per host and per test version unique id (to be able to correlate results), test version can be bumped by either the test itself or global version. unique idss are IMO the easiest way to correlate those (and latter easy to manage in DBs and so on). Of course this format has to also include test parameters, results, (logs?), additional partial results.
  4. We should probably repeatedly run some test until std dev is bellow threshold and note that std dev
  5. I think simple HTTP endpoint for now is good bet to be able to collect the test results, for now we can put it into flat file and use jq to sort through it, extract the data and so on.
  6. NoSQL DB could be good option to store those tests latter.
  7. If we process the results in some ways, jq filters, graphs (we probably should decide on one software for that, R is dealing great with many datapoints, octave not really), we should make it so others can reproduce those processes at a latter term and/or for different filter/test.
  8. "Dispatching Server", we could have a server that test clients ask for tests to run (given their limitations and hardware).
  9. File system detection, different file systems' performance probably degrades in different ways. This leads to:
  10. System info, performance could be CPU, Disk, RAM/Caches bound, it would be probably worth collecting this data and small benchmarks.

Some of those are probably out of scope, we have to decide what is required what can be passed off for now.

Kubuxu commented 7 years ago

@mejackreed I've looked at the partial metadata myself but it is quite a complex format. Would it be possible to get sizes of files in the dataset as a new-line delimited values in bytes from fragment of the dataset? It for sure doesn't need to be whole, few percent probably would be enough for us to observe the distribution and create a test case basing on it.

flyingzumwalt commented 7 years ago

Note: The data.gov data is aggregated from many federal agencies, so the content, its structure and the formats might be widely divergent.

Kubuxu commented 7 years ago

Yeah, it would be good to get a cross section sample, but any sample will be better than none.

mejackreed commented 7 years ago

I'll work on putting some samples together. What metrics would you prefer? number of files and file size?

Kubuxu commented 7 years ago

I am mainly looking for file sizes, but number of files per 'publication' could be useful too.

flyingzumwalt commented 7 years ago

@gsf @hsanjuan @VictorBjelkholm can any of you help @Kubuxu land this? He needs help getting kubernetes set up.

Kubuxu commented 7 years ago

It is mostly about integrating this feature/test set: https://github.com/ipfs/archives/issues/102#issuecomment-273561565 to https://github.com/ipfs/kubernetes-ipfs, I was able to set up kubernetes on my machine but real problem is implementing the tests.

Few preliminary questions:

  1. Is it possible to simply restart a pod with cli arguments changed?
  2. What is the simple way of injecting tools into running pods, let's say I need to generate complex directory structure and I have a tool for that already written but it isn't in the pod.
Kubuxu commented 7 years ago

My idea right now is to write simple wrapper/RPC server that would be used to manage the ipfs instance inside the pod, it would perp the repo, allow for choice of the ipfs version and its command arguments, config variables and so on. It would also include all required tools and or call them.

This means that the wrapper/RPC has to be created, docker image with it.

I am not 100% sure if it is the right path forward but it seems a good path for me.

whyrusleeping commented 7 years ago

@Kubuxu that was the direction i was planning on going with https://github.com/ipfs/notes/issues/191

The client would be a program that manages an ipfs node as you say, though yours will be a server (where we make requests against it), and in my testbed doc, theres a central server that all the clients connect to that coordinates the tests and sending out commands

victorb commented 7 years ago

figure out how to include other binaries (ie. random files) into kubernetes container (@VictorBjelkholm can help with this)

Take a look at volumes: https://kubernetes.io/docs/user-guide/volumes/ They are kind of the same as Docker volumes but a bit different.

Regarding binaries, what kind of binaries and for what purpose? Normally, you would wrap the binary in a container that you can deploy as usual, but it depends on what you're trying to achieve.

If you need binaries inside the go-ipfs container, just create a new go-ipfs-dev image based on the ipfs/go-ipfs one, while in the build step include the necessary binaries.

figure out how to dynamically generate kubernetes deployments with varying configs and daemon args (and referencing those deployments from tests)

I'm guessing this is so we can test different ~/.ipfs/config's and also send for example experimental flags to the daemon.

I think the simplest route would be the one @jbenet explained in the call. Have one base deployment with the basic stuff for running a go-ipfs daemon.

Have a yml/json file with what different configs/arguments you have. Write a small tool that takes the base + configuration files and generate every possible version of that.

figure out how to aggregate the reports from the test runs & permutations

I had great success with using prometheus and it was relatively easy to setup. To keep in mind would be to log the test-run's ID so we can have one report per test run.