harrispopgen / mushi

[mu]tation [s]pectrum [h]istory [i]nference
https://harrispopgen.github.io/mushi/
MIT License
24 stars 6 forks source link

Notation standardization & improvement #43

Closed kamdh closed 4 years ago

kamdh commented 4 years ago

image

  1. Simplify notation in paper. Focus on notation for N(t) and mu(t) are best
  2. Make code ascii compatible for end-user ease. Use explanatory names for variables (step_size rather than eta).
  3. Standardize notation between code and paper.
wsdewitt commented 4 years ago
  1. This library is already quite tidy. Is your concern actually readability and structure of the API, or you just can't help yourself from sharing this meme (for the 3rd time today ;) )? I had thought your concern was simply that users won't know how to write unicode μ, but all users will read μ as mutation rate. If you think the API does need restructuring, now is definitely the time to say so.

  2. I have mixed feelings on replacing η with N. Using N will introduce a proliferation of ploidy factors (2 for humans, some of which we're sure to miss), and loss of generality. Note that for haploids η = N, for diploids, η = 2N, for tetraploids η=4N, ... N is more recognizable as effective pop size, but it's annoying to bookkeep ploidy. Perhaps the way to go is to use N in main text, but pass to η in the appendix for full generality, and more direct comparability to Yun's paper that we borrow from. Likewise, in the code the API could use N and demographic history objects could include a keyword ploidy argument (default=2) in their initialization. Then the implementation can use η. This way the ploidy factor is only needed once.

  3. I think it's good to make the API ascii compatible (kSFS.mu instead of kSFS. μ), while keeping some unicode in the implementation (since end users won't be going there, and any developer trying to understand it must first boot the paper and all its notation into meat RAM).

  4. I think it's fine if the under-the-hood numerics for the discretized problem look more like machine learning notation (x, y, z), but the main text of the paper should use popgen notation.

kamdh commented 4 years ago

So I meant to refer to address a few things in here. I think the use of eta / y and mu / Z is confusing. I think it would be better to use different fonts for the functions and discretized versions. My preference would be bold font for the discrete vectors. Using bold for vectors and capital bold for matrices is always nice. The capitol of mu would be M, in this case, which I'm okay with if that's what you think.

I'll defer to you and Kelley on the N versus eta debate. It seems like the code doesn't need to deal with the ploidy anyway; the user could post-process that. I have always thought it was a little sloppy to use eta for something close to N. It looks like an n, but it's really the Greek e.

By the way, I just glanced at dadi, and they use nu: https://bb.githack.com/gutenkunstlab/dadi/raw/master/doc/api/dadi/Demographics1D.html nu makes a lot more sense than eta for me.

For the code, I think we should remove the unicode since anyone using the package and trying to tab-complete would have to be able to type the unicode. For internal bits of code that end-users will never access, you could keep unicode around. But I would prefer not to have it.

The other thing to consider, which would probably add usability, would be to have the "mu" and "eta" variables accessible as "mutation_rates" and "pop_size" in the code. That would make it easier to use the interface. It wouldn't as neatly match the paper, but like we were discussing, a user doesn't need to keep track of that in their head.

kamdh commented 4 years ago

And did we already fix the regularization parameters to not have any alphas? That was on my to-do list.

wsdewitt commented 4 years ago

yeah, if you mean the elastic net style alphas. Those are gone. alpha/beta now distinguishes eta vs mu reg params

wsdewitt commented 4 years ago

I see what you mean about eta being e not n. Note nu in dadi is defined in units of an ancestral size (which they use to set their time scale), so also has potential to confuse users coming from dadi. I think N for main text and API is the way.

wsdewitt commented 4 years ago

another thing to consider when it comes to fixing imperfect-but-entrenched notation: https://xkcd.com/927/