Agree on outline - Githubissues

hammer commented 3 years ago

@eric-czech has an initial proposal at https://github.com/pystatgen/sgkit-publication/blob/main/content/01.outline.md.

jeromekelleher commented 3 years ago

Yep, overall narrative sounds good to me. We should break things down into some sections, I guess, and then sketch out the outlines of each of these sections as separate issues?

This is also determined by the journal and article format choice (#1), as some journals have pretty weird sectioning requirements.

alimanfoo commented 2 years ago

Here's another very rough possible starting point for an outline, similar to Eric's: https://docs.google.com/document/d/1TKSS--28ErsGjdyH6AHXnAcB7ngwCMXm5JK-u0dA_3c/edit?usp=sharing

hammer commented 2 years ago

eLife recommends the following outline:

Introduction
Results
Discussion
Methods
Acknowledgements
References
Figures
Tables

hammer commented 2 years ago

Pulling some previous discussions that could be useful here:

jeromekelleher commented 1 year ago

Here's an outline that @tomwhite, @benjeffery and I came up with last week:

Introduction
- What is Pydata?
- Columnar binary data (zarr)
- Distributed
- numba
- Other related technologies that we're using
Results
Discussion
- Light recap, picking up an meta-points that we haven't made through the rest of the paper.

Then within results we have

PyData for genomics
- sgkit design principles
- Overview of data structures, etc
- Discussion of basic performance characteristics, illustrating that the general strategy scales well in terms of single-threaded compute performance and space utilisation.
Population Genetics (#8)
Statistical Genetics
Quantitative Genetics (#9)
Software Development (Better name needed?)
- Reimplenting REGENIE and gene-e (quick comparison of LoC and rough performance numbers)
- Extending sgkit's zarr on-disk structures in tsinfer.
- Important to stress the point that stuff doesn't need to be in sgkit to benefit from sgkit. You can implement your own methods outside sgkit using the tools and are in no way obliged to contribute stuff into the repo.
Scaling to large datasets
- GPU pairwise distance example (but, could make this an example for a Phylogenetics section also)
- Scaling out with Dask (can refer to Liangde’s thesis/paper?)

What do we think? The first section (pydata for genomics) gets directly to the point of discussing sgkit's design principles and data structures, letting the intro set the scene of the software infrastructure around us.

In terms of display items, we would refer to the Scaling and compute (#7) in the pydata for genomics section, plus the . We probably don't need display items for the rest of the paper.

The PopGen, StatGen, QuantGen (and PhyloGen?) sections are a way to allow readers interested in just those areas to skip in and see what sort of things sgkit can do, without having the trudge through API listings. We want to give one (or two) concrete examples showing useful things being done, giving indicative performance figures without getting bogged down in direct performance comparisons. It also gives us a space to quickly discuss the tools that people use and illustrate how fragmented the ecosystem is.

jeromekelleher commented 1 year ago

If we roughly agree on this outline I can make some more issues to track the different sections, and sketch out what we want to say in them.

hammer commented 1 year ago

Looks great to me thanks for moving this forward!

hammer commented 1 year ago

I should have asked: how do we define stat, pop, and quant gen? I generally think of pop gen as variation without phenotype and stat gen as variation with phenotype. I’m not sure where that leaves quant, perhaps as the union of the two? If so, do we need to rename to qgkit?

jeromekelleher commented 1 year ago

There probably isn't a good definition, but we can just do something pragmatic based on the user communities. PopGen people are mostly interested in evolutionary biology itself, Statgen mostly in applications to humans and Quantgen mostly to applications in agriculture.

The tools they use are mostly nonoverlapping sets I think.

eric-czech commented 1 year ago

Introduction

Should this include a mention of trends in python adoption? And/or why this is an important tailwind to ride given AI progress?

We want to give one (or two) concrete examples showing useful things being done

FWIW on the StatGen piece, I think https://github.com/pystatgen/sgkit-publication/issues/9 is a good template for that. I also think that would probably be a good place to touch on the potential power and relatively nascent state of pathway GWAS (gene-e), GWAS/ExWAS methods in general (e.g. REGENIE), some of the QC ops necessary to get there (HWE, pruning, filtering) and general purpose operations like those for creating LD matrices and kinship coefficients (pc-relate).

@jeromekelleher I could outline some of those in more detail in a StatGen specific issue at some point if you or someone else (@hammer perhaps?) hasn't already done anything related to it. I'm not sure how this interacts with the Software Development section though -- perhaps you have some thoughts there?

jeromekelleher commented 1 year ago

@jeromekelleher I could outline some of those in more detail in a StatGen specific issue at some point if you or someone else (@hammer perhaps?) hasn't already done anything related to it.

@eric-czech please do go ahead and create an issue to sketch out your thoughts on StatGen. Don't worry too much about how things fit into the overall structure, just get the key points that you think should get in there down in some form, and I'll bring it together into the document.

jeromekelleher commented 11 months ago

I'm going to close this as out-of-date now.

sgkit-dev / sgkit-publication

Agree on outline #2