Produce yaml output files for cft consumption

metasoarous commented 6 years ago

We need a yaml file per dataset, according to spec from #216.

psathyrella commented 6 years ago

1f20dd1b-2950-4e90-adb8-3cdd27a945c3

adding chalkshot from defunct issue

psathyrella commented 6 years ago

@csmall lmk what you think /fh/fast/matsen_e/processed-data/partis/laura-mb/latest/info.yaml

metasoarous commented 6 years ago

@psathyrella Looks good! Here are my requests/suggestions:

please add a top level key something like dataset-id; should be unique between any combination of study, build version, and any other setting which might lead to a separate run of partis
somewhere in here, add partition-cluster-annotations as appropriate
I may not actually need the seed sequences themselves; as long as we have all of the seed ids somewhere
still need to settle on where we put the sequence-metadata pointer (probably in general via a sequence-metadata-filename` attr); I could handle whatever is easier for you but imagine that specifying either per sample or per dataset would work just fine

Perhaps something like this would accommodate these issues?

dataset-id: blablah
sequence-metadata-filename: /src/bin/whatevs.csv
samples:
  <sample-id>:
    meta: ...
    partition-file: ...
    cluster-annotation-file: ...
    seeds:
      <seed-id>:
        meta: ?
        partition-file: ...
        cluster-annotation-file: ...

psathyrella commented 6 years ago

ok v2 /fh/fast/matsen_e/processed-data/partis/laura-mb/latest/info.yaml

I don't remember where you put the translation files, so I just put a placeholder in there for 'per-sequence-meta-file'.

Also, not sure if it was obvious, but general design philosophy is: write initial file with all meta data from datascripts/meta (hence all the empty dicts), then gradually fill things in with actual paths as we check that they've run successfully.

metasoarous commented 6 years ago

@psathyrella I've got this more or less "working" now. There are still a few more kinks to work out, like #228, but I think we're getting close to a point where it would be nice to have a complete yaml fleshed out with all of the data you've build so far, so I can get all of the datasets built out (there were some schema updates that came out of the yaml refactor, so I'll need to rebuild all the datasets).

psathyrella commented 6 years ago

ok, I should be able to make yamls for everything in the next few days. LMK if that isn't soon enough.

metasoarous commented 6 years ago

That's perfect. Thanks.

matsengrp / cft

Produce yaml output files for cft consumption #221