Closed jeromekelleher closed 11 months ago
This is a good narrative, couple of thoughts:
I know it is a big part of the story of sgkit, but going into the details of each enabling library seems a bit too technical for the high-level summary that we want to draw people in with. Obviously getting into dask, zarr etc later on is needed, but "JIT-compiled Python working on distributed, rectangular, chunked arrays with metadata work great for genetics" is the main story and "doing that via standard open libraries enables inclusive development and interoperability" is another.
One paragraph that I thought might be missing is an explanation of why existing solutions are insufficient and motivate the need for sgkit?
Good points, thanks @benjeffery
I think the comparison with existing methods has to go into the section about the storage strategy - you just can't talk about this stuff without getting into the weeds.
Forgot to say - where you have "% FIXME "unit" is the wrong word" could be "segment" or "piece"?
I've made a pass at writing the "story summary" here, which I think would be worth others taking a look at (line 97 to 170 in paper.tex).
Basically, we split the narrative into two bits. First, we tackle the big fundamental questions and show why our approach works well. Then, in the second part we showcase the functionality of sgkit via some case studies.
I think this would be a nice paper - what do you think @benjeffery @hammer @timothymillar @tomwhite ?