sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
233 stars 32 forks source link

Rename api.py #91

Closed tomwhite closed 4 years ago

tomwhite commented 4 years ago

The API is not just in api.py anymore so we should rename it to dataset.py or similar.

eric-czech commented 4 years ago

If the repos ever merge I can almost see those functions making sense in an io.py module or perhaps the top level utils.py module.

I think they're reusable enough (e.g. testing.py) though that dataset.py makes sense to me, or perhaps model.py. I can imagine we may eventually have a datasets.py module related to fetching public data. It could be good to use something less similar so we don't have dataset.py and datasets.py.

I think this kind of division may also make sense soon (cf https://github.com/pystatgen/sgkit/issues/94):

sgkit/
  stats/
    quantgen/          # quantitative genetics methods  
      regenie.py
      association.py
    popgen/             # population genetics methods
      hwe.py
      diversity.py
      fst.py
    aggregation.py  # summary stats
    ld_prune.py
  model.py
  typing.py
  testing.py
  ...

Consolidating a lot of the popgen methods into larger modules would be a good idea, and the popgen/quantgen distinction could be a useful one as more PRs come in for them.

jeromekelleher commented 4 years ago

How "public" do people see the internal file organisation being? I tend to prefer keeping the public API as flat as possible, and regard the internal file structure as a private implementation detail. I think this is the approach we've been taking so far, I just want to make sure I understand what folks are thinking.

tomwhite commented 4 years ago

I agree that the public API should be as flat as possible. Functions can be in the same (top-level) namespace, but grouped by heading (IO, quantatative genetics, population genetics) on https://pystatgen.github.io/sgkit/api.html. (That reminds me, we need to keep that page in sync with the exported API.)

The only public function that I know of that is not currently at the top-level is testing.simulate_genotype_call_dataset. We have an example with from sgkit.testing import simulate_genotype_call_dataset... Should we move it to the top-level?

jeromekelleher commented 4 years ago

I agree that the public API should be as flat as possible. Functions can be in the same (top-level) namespace, but grouped by heading (IO, quantatative genetics, population genetics) on https://pystatgen.github.io/sgkit/api.html. (That reminds me, we need to keep that page in sync with the exported API.)

:thumbsup:

The only public function that I know of that is not currently at the top-level is testing.simulate_genotype_call_dataset. We have an example with from sgkit.testing import simulate_genotype_call_dataset... Should we move it to the top-level?

I don't really have an opinion there - either works, since it's not core functionality. Keeping it in sgkit.testing sort of emphasises that it's just for testing (not real statgen), which is good.

eric-czech commented 4 years ago

I tend to prefer keeping the public API as flat as possible, and regard the internal file structure as a private implementation detail.

👍

The only public function that I know of that is not currently at the top-level is testing.simulate_genotype_call_dataset

I think this function (or something like it) is key for a lot of documentation. Being able to play with the parameters for the dimensions that get created as well as missingness will be an important part of the Xarray learning process. It does seem a little peculiar for it to come out of a testing module though if we're going to use it widely. Perhaps it should be in a datasets.py module instead, which also contains code for fetching public or other static test datasets?

hammer commented 4 years ago

Any new thoughts on this one? Now that we've made some important design decisions on the API, I think we should put more energy into documentation, and I think this rename/reorganization would be nice to have done before that documentation is written.

tomwhite commented 4 years ago

The internal organisation is orthogonal to the public API, so this issue shouldn't hold up docs. Having said that, I think this issue is just a simple rename. In https://github.com/pystatgen/sgkit/pull/236, I have renamed to model.py, as suggested in this issue - and also echoing scikit-allel.