Closed tomwhite closed 4 years ago
If the repos ever merge I can almost see those functions making sense in an io.py
module or perhaps the top level utils.py
module.
I think they're reusable enough (e.g. testing.py
) though that dataset.py
makes sense to me, or perhaps model.py
. I can imagine we may eventually have a datasets.py
module related to fetching public data. It could be good to use something less similar so we don't have dataset.py
and datasets.py
.
I think this kind of division may also make sense soon (cf https://github.com/pystatgen/sgkit/issues/94):
sgkit/
stats/
quantgen/ # quantitative genetics methods
regenie.py
association.py
popgen/ # population genetics methods
hwe.py
diversity.py
fst.py
aggregation.py # summary stats
ld_prune.py
model.py
typing.py
testing.py
...
Consolidating a lot of the popgen methods into larger modules would be a good idea, and the popgen/quantgen distinction could be a useful one as more PRs come in for them.
How "public" do people see the internal file organisation being? I tend to prefer keeping the public API as flat as possible, and regard the internal file structure as a private implementation detail. I think this is the approach we've been taking so far, I just want to make sure I understand what folks are thinking.
I agree that the public API should be as flat as possible. Functions can be in the same (top-level) namespace, but grouped by heading (IO, quantatative genetics, population genetics) on https://pystatgen.github.io/sgkit/api.html. (That reminds me, we need to keep that page in sync with the exported API.)
The only public function that I know of that is not currently at the top-level is testing.simulate_genotype_call_dataset
. We have an example with from sgkit.testing import simulate_genotype_call_dataset
... Should we move it to the top-level?
I agree that the public API should be as flat as possible. Functions can be in the same (top-level) namespace, but grouped by heading (IO, quantatative genetics, population genetics) on https://pystatgen.github.io/sgkit/api.html. (That reminds me, we need to keep that page in sync with the exported API.)
:thumbsup:
The only public function that I know of that is not currently at the top-level is testing.simulate_genotype_call_dataset. We have an example with from sgkit.testing import simulate_genotype_call_dataset... Should we move it to the top-level?
I don't really have an opinion there - either works, since it's not core functionality. Keeping it in sgkit.testing
sort of emphasises that it's just for testing (not real statgen), which is good.
I tend to prefer keeping the public API as flat as possible, and regard the internal file structure as a private implementation detail.
👍
The only public function that I know of that is not currently at the top-level is testing.simulate_genotype_call_dataset
I think this function (or something like it) is key for a lot of documentation. Being able to play with the parameters for the dimensions that get created as well as missingness will be an important part of the Xarray learning process. It does seem a little peculiar for it to come out of a testing module though if we're going to use it widely. Perhaps it should be in a datasets.py
module instead, which also contains code for fetching public or other static test datasets?
Any new thoughts on this one? Now that we've made some important design decisions on the API, I think we should put more energy into documentation, and I think this rename/reorganization would be nice to have done before that documentation is written.
The internal organisation is orthogonal to the public API, so this issue shouldn't hold up docs. Having said that, I think this issue is just a simple rename. In https://github.com/pystatgen/sgkit/pull/236, I have renamed to model.py
, as suggested in this issue - and also echoing scikit-allel.
The API is not just in
api.py
anymore so we should rename it todataset.py
or similar.