What to do with csv input file relationships

metasoarous commented 5 years ago

@psathyrella I'm wondering what to do with cft regarding specification of partition-file separate from cluster-annotation-file. This used to be "supported" by CFT, and the README still more or less suggest that it would be now. However, some recent changes in the partis/python utils code have made it such that one cannot specify an arbitrary cluster-annotation-file option, and must infer this files location instead in relation to the partition-file.

I don't want to rely on the file naming relationship for CFT to function properly, as it feels messy, and it's annoying to have to explain in the instruction. We could solve this by either:

altering the partis/python function in question to allow one the option of specifying a specific annotation csv file
say that CFT only supports yaml output henceforth

I realize that you're opinion was that yaml was really what you wanted to emphasize for the future, but I'm wrapped up in this question because @BrandenOlson is trying to run CSV data he got out from partis simulate into CFT, and is not sure what to put where. Is it possible for him to spit out yaml from partis simulate?

psathyrella commented 5 years ago

Yeah, sorry, I just don't think there's any way that anyone would ever need to specify a non-default cluster annotation file name. So I don't think you need to say that it only supports yaml -- it should work just fine with csv -- but I can't think of a reason that the naming convention should be mentioned in your docs. The only way that someone could get non-default naming is if they ran an old version of partis and then went and renamed the cluster annotation file. Which, I mean, it's not impossible but it really doesn't seem like a use case that is worth supporting/explaining. The way I set it up in cft when I switched this, the only file name they should have to deal with is the partition output file. If it's a csv file, yeah, there's this other file floating around next to it, but there's no reason for the user to know about it. Eliminating the extra file was a large motivation for switching to yaml.

As to csv simulation files -- those are neither old csv partition files nor old csv cluster annotation files, they're single-sequence annotations with a reco_id column to tell you who's clonally related to who. It would be a couple-line python script to read old simulation csvs and output new yamls, but if possible it's probably easier to rerun the simulation step with a yaml output file -- it just spits out csv or yaml depending on what the file suffix you give it is.

metasoarous commented 5 years ago

They would need to specify a different annotation file name if their calls were wrapped up in other processes which move the output files around, which @BrandenOlson's code does IIRC (correct me if I'm wrong Branden).

BrandenOlson commented 5 years ago

Yes, I do wrap these files in some R calls, but I don't mess with the naming conventions of the files that are generated by partis.

matsengrp / cft

What to do with csv input file relationships #259