Generalize cft and cftweb

metasoarous commented 7 years ago

Right now, cft and cftweb are dependent on a number of assumptions which are specific to our particular setting. On the ugly side of the spectrum, there's still a peppering of regexs for metadata extraction and file selection (though at least some of this has moved over to metadata files). More fundamentally though, we're only really set up to process data from seed partitioning runs of partis. All of this should be generalized so that there's not too difficult a path for someone with partis output to stick their stuff in and have it run through.

@psathyrella How much different might different partis output look? Will it generally be the same directory structure? What does it look like for data run through with out seed partitioning?

Do you have any plans to make datascripts public? As that's presently the de facto specification of what data should look like as far as cft and cftweb are concerned, it's worth thinking about whether its role in or relationship towards such a generalization endeavor. I'd welcome your thoughts on those lines as well.

psathyrella commented 7 years ago

datscripts is private 'cause laura wanted it that way... in principle we could separate out the bits that she presumably wants private (the files in meta/ and seeds/ for her data) from the rest of datascripts, but then you'll still have to check out that private info some how. Are private branches a thing?

The file formats are identical for seed and non-seed. The output directory/file structure is set by datascripts, so it should be consistent, but the complication is for seed partition, you're looping through a bunch of seeds for each data set, while for plain partition you're just running once for the data set.

I just ran on a small sample of one of the data sets with output here:

/fh/fast/matsen_e/processed-data/partis/kate-qrs/vTMP/partitions/

with command:

./datascripts/run.py partition --study kate-qrs --dsets 1g --n-random-queries 1000 --extra-str=vTMP --check

metasoarous commented 7 years ago

Closing this issue since now converted to epic issue #215.

matsengrp / cft

Generalize cft and cftweb #166