Closed hammer closed 11 months ago
Yep, overall narrative sounds good to me. We should break things down into some sections, I guess, and then sketch out the outlines of each of these sections as separate issues?
This is also determined by the journal and article format choice (#1), as some journals have pretty weird sectioning requirements.
Here's another very rough possible starting point for an outline, similar to Eric's: https://docs.google.com/document/d/1TKSS--28ErsGjdyH6AHXnAcB7ngwCMXm5JK-u0dA_3c/edit?usp=sharing
eLife recommends the following outline:
Pulling some previous discussions that could be useful here:
Here's an outline that @tomwhite, @benjeffery and I came up with last week:
Then within results we have
What do we think? The first section (pydata for genomics) gets directly to the point of discussing sgkit's design principles and data structures, letting the intro set the scene of the software infrastructure around us.
In terms of display items, we would refer to the Scaling and compute (#7) in the pydata for genomics section, plus the . We probably don't need display items for the rest of the paper.
The PopGen, StatGen, QuantGen (and PhyloGen?) sections are a way to allow readers interested in just those areas to skip in and see what sort of things sgkit can do, without having the trudge through API listings. We want to give one (or two) concrete examples showing useful things being done, giving indicative performance figures without getting bogged down in direct performance comparisons. It also gives us a space to quickly discuss the tools that people use and illustrate how fragmented the ecosystem is.
If we roughly agree on this outline I can make some more issues to track the different sections, and sketch out what we want to say in them.
Looks great to me thanks for moving this forward!
I should have asked: how do we define stat, pop, and quant gen? I generally think of pop gen as variation without phenotype and stat gen as variation with phenotype. I’m not sure where that leaves quant, perhaps as the union of the two? If so, do we need to rename to qgkit
?
There probably isn't a good definition, but we can just do something pragmatic based on the user communities. PopGen people are mostly interested in evolutionary biology itself, Statgen mostly in applications to humans and Quantgen mostly to applications in agriculture.
The tools they use are mostly nonoverlapping sets I think.
Introduction
Should this include a mention of trends in python adoption? And/or why this is an important tailwind to ride given AI progress?
We want to give one (or two) concrete examples showing useful things being done
FWIW on the StatGen piece, I think https://github.com/pystatgen/sgkit-publication/issues/9 is a good template for that. I also think that would probably be a good place to touch on the potential power and relatively nascent state of pathway GWAS (gene-e), GWAS/ExWAS methods in general (e.g. REGENIE), some of the QC ops necessary to get there (HWE, pruning, filtering) and general purpose operations like those for creating LD matrices and kinship coefficients (pc-relate).
@jeromekelleher I could outline some of those in more detail in a StatGen specific issue at some point if you or someone else (@hammer perhaps?) hasn't already done anything related to it. I'm not sure how this interacts with the Software Development
section though -- perhaps you have some thoughts there?
@jeromekelleher I could outline some of those in more detail in a StatGen specific issue at some point if you or someone else (@hammer perhaps?) hasn't already done anything related to it.
@eric-czech please do go ahead and create an issue to sketch out your thoughts on StatGen. Don't worry too much about how things fit into the overall structure, just get the key points that you think should get in there down in some form, and I'll bring it together into the document.
I'm going to close this as out-of-date now.
@eric-czech has an initial proposal at https://github.com/pystatgen/sgkit-publication/blob/main/content/01.outline.md.