Closed jeromekelleher closed 4 years ago
The naming looks fine to me, although plink2sgkit
might be slightly preferable.
The naming looks fine to me, although plink2sgkit might be slightly preferable.
Sure, I can see the advantage there. I'm happy to follow the consensus on this one.
Looks great to me @jeromekelleher. I think this will be a good place to build up some of the necessary options or hard-coded behaviors that should live between calls to create the readers (like read_plink
) and .to_zarr
.
Thanks for the quick work on this @jeromekelleher!
Before we land a CLI tool in this repo, I just wanted to step back and think about the CLI interaction we want for sgkit
users.
I tend to prefer tools that have a single point of entry for CLI interactions such as git
or conda
. Is there a way we could have a single sgkit
CLI tool that could have subcommands for each file format?
I tend to prefer tools that have a single point of entry for CLI interactions such as git or conda. Is there a way we could have a single sgkit CLI tool that could have subcommands for each file format?
Yes, good call, let's think this through. For the core sgkit
command line ideally I'd like to see something like
sgkit info <sgk dataset> - high level details about a sgkit zarr dataset. Size, compression etc.
sgkit ls <sgk_dataset> - List the arrays in the dataset (with -l, show more details)
sgkit rechunk <sgk_dataset> - recompute chunk sizes
sgkit import-plink <input data> <output sgk dataset> - Convert input data into sgkit format.
sgkit export-plink <sgk_dataset> <output_file>
sgkit import-vcf <input data> <output sgk dataset> - Convert input data into sgkit format.
sgkit export-vcf <sgk_dataset> <output_file>
So, how could we do sgkit import-x
format? I guess if each of the known file format packages registered themselves with sgkit
(kinda like plugins), and registration requires a import function and an export function. We could even try having a sgkit import
command which tried each of the known formats using a sniffer function defined by the format plugin. The tricky bit, I guess, would be defining the UIs for the different file formats, as we'll need format specific options for each one. I'm sure this could be done, though.
I think I like this approach better, good idea @hammer! It also makes the requirements for one of the file format plugins quite clear, with a well defined interface. Hopefully that keeps the format plugins relatively simple, and concerned only with reading and writing the format in question.
What do others think?
I think I should probably finish the initial version of this CLI just to get something working, and so we have a good idea of what the format specific arguments are going to look like. It'll be easy enough to pull it out afterwards.
We could even try having a
sgkit import
command which tried each of the known formats
Yes! I love IO libraries that infer file formats and their properties for me. For example, inferring the presence of a header and delimiter used in a CSV file.
I can also imagine this interface is where most users will discover they need a separate package to read/write a particular file format, and we could guide them through the installation here.
Yes, I like this plugin idea a lot actually, it seems to be a good compromise between supporting lots of formats but not letting the main sgkit package get all sprawly and horrible.
Any thoughts on the CLI structure here @tomwhite, @eric-czech, @ravwojdyla?
It's a minor comment, but from a plugin perspective, I think it's going to be easier for the commands to be something like sgkit vcf import
/sgkit vcf export
versus sgkit import-vcf
/sgkit export-vcf
, but that's to be determined.
I've filed https://github.com/pystatgen/sgkit/issues/53 to track the design and implementation of a unified CLI that can accommodate plugins.
Any thoughts on the CLI structure here?
I like the subcommand structure, implemented using interfaces and plugins.
OK, thanks. Rather than spend more time here then, I might just make a start on what this would look like in the main repo. There'll be some rough edges, but it's good to make a start. I'll leave this one open for now anyway.
Closing in favour of pystatgen/sgkit#57
Closes #7
This is an initial outline of the conversion CLI.
I haven't written any testing code yet, as I wanted to see if people agreed with the basic organisation first (and ran out of time!).
This adds a
plink2sg
CLI that can be run either viapython -m sgkit_plink plink2sg
or viaplink2sg
, if the package has been installed and the user's PATH set up accordingly. There's a bit of indirection needed to do this, but I think it's worth it as it's always useful to be able to use thepython -m
version. I haven't madepython -m sgkit_plink
callplink2sg
directly because I think we'll inevitably need to have another method,sg2plink
, which does the conversion in the other direction. There's some extra stuff in there that will facilitate testing, also, and having it broken up into little pieces makes things easier to mock out.What do we think of this organisation? Do we like the names?