Presubmission Inquiry for HaploDynamics

Submitting Author: Remy Tuyeras (@remytuyeras)
Package Name: HaploDynamics One-Line Description of Package: A python library to develop genomic data simulators Repository Link (if existing): https://github.com/remytuyeras/HaploDynamics

Code of Conduct & Commitment to Maintain Package

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package after should it be accepted.
[x] I have read and will commit to package maintenance after the review as per the pyOpenSci Policies Guidelines.

Description

Include a brief paragraph describing what your package does:

The HaploDynamics package provides a collection of functions to generate simulated population-specific genomic data in VCF format. The library includes parameters and functions to control mutation rates, linkage disequilibrium strength and block lengths, and number of individuals. To generate genomic data, the HaploDX framework offers a pipeline of functions that can be used to simulate: (1) the allele frequency spectra of different populations; (2) the Hardy-Weinberg principle for genotypes and haplotypes; (3) linkage disequilibrium across different populations.

Community Partnerships

We partner with communities to support peer review with an additional layer of checks that satisfy community requirements. If your package fits into an existing community please check below:

[ ] Pangeo
- [ ] My package adheres to the Pangeo standards listed in the pyOpenSci peer review guidebook

Scope

Please indicate which category or categories this package falls under:

Scope

Please indicate which category or categories. Check out our package scope page to learn more about our scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
- [ ] Data retrieval
- [ ] Data extraction
- [ ] Data processing/munging
- [ ] Data deposition
- [ ] Data validation and testing
- [ ] Data visualization
- [ ] Workflow automation
- [ ] Citation management and bibliometrics
- [ ] Scientific software wrappers
- [ ] Database interoperability

Domain Specific & Community Partnerships

- [ ] Geospatial
- [ ] Education
- [ ] Pangeo
- [x] Unsure/Other (explain below)

Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

I am not sure whether the package falls under one of the categories and was wondering if you editorial team could guide me on this.

Who is the target audience and what are the scientific applications of this package?

The decreasing costs of DNA sequencing and cloud computing have led to a surge in the deployment of genomic pipelines on cloud platforms for international collaboration. This development has made it increasingly important to develop privacy-preserving methods that protect both patient privacy and the integrity of research that uses the data. Regulatory and privacy-preserving protocols have made genomic pipeline auditing more stringent and thorough, requiring them to be tested on simulated data to verify the validity of the algorithms that process the data. However, the process of iteratively generating and/or transferring genomic simulations on the cloud can be a major bottleneck for research.

To address this challenge, the HaploDynamics package provides a cloud-native framework for generating fast and lightweight simulations of genomic data that can replicate its complexity. The package also allows researchers to add their own generative mutation models to the framework, making it possible to create customized simulations that meet specific research needs.

Are there other Python packages that accomplish similar things? If so, how does yours differ?

MaCS (coded in C++) is a fast coalescence-based method for generating genomic data. However, its use of graph structures to encode segregation and recombination events can slow down its execution time.

Both MaCS and HaploDX rely on Bayesian networks to generate variants, which allows them to output rows of genotypes sequentially in a similar fashion. However, HaploDX does not use graph structures, relying instead on additional mathematical considerations. This difference allows HaploDX to outperform MaCS on execution times and memory usage.

Other packages, such as Sim1000G (coded in R), also rely on coalescence/pedigree structures, which can impair their execution times (and memory usage).

Ref:

https://github.com/adimitromanolakis/sim1000G
https://github.com/gchen98/macs/tree/master
Any other questions or issues we should be aware of:

I am wondering if the HaploDynamics package fits the scope of the pyOpenSci project. If not, I would like to know if you have any recommendations for other packages or projects that I should consider.

Further, while HaploDynamics is fully functional, it is also under active development of new functionalities. Is this something that would be compatible with the pyOpenSci project?

P.S. Have feedback/comments about our review process? Leave a comment here

Welcome to the pyOpenSci community @remytuyeras.

Yes, at first glance, HaploDynamics looks like it is in scope.

What would help us determine this would be complete documentation for the project, including examples that directly use HaploDynamics itself, that are built and rendered as a separate site.

Such documentation would also be required before a full submission.

I do see you have documentation in docs/source and that you point to a blog post on your personal site. As far as I can tell the blog post does not specifically show usage of HaploDynamics itself, and is instead a more general tutorial on simulating pop genomics with Python. It's great that you have written that, but for submission you will need examples usages of HaploDynamics in your documentation. I see some example snippets in docs/source but you will also want stand alone examples of "how-tos" or "walkthroughs" such as this and this in the pynteny documentation.

What you will need to do is:

[ ] Set up your docs so they can be rendered as a separate web page by Sphinx or MkDocs
- [ ] Note that you can mostly avoid rewriting from Markdown to RestructuredText by using MyST, as discussed here
- [ ] You can probably recycle a large part of your README as a landing page for the docs, and avoid having the two go out of sync, by using a literal include of the README in your docs
- [ ] API documentation -- I see this page but it looks like you are defining the API by hand? You may find it helpful to instead use autodoc and autosummary for example as done here in the PyGMT API docs
- [ ] Include full examples focused specifically on usage of HaploDynamics; these should be vignettes that illustrate the intended functionality of the package. They can be expanded versions of the snippets you have now
[ ] build and render the docs with an external service such as ReadTheDocs or GitHub Pages, as described in this section of our guide

Please let me know if that is clear.
I'll be happy to have a look and get more feedback from our editorial board on whether HaploDynamics is in scope once you have full documentation that will help us better understand the intended audience, functionality, and usage of the package. It looks like you are most of the way there now! Just set up a build please 🙂 and make sure the rendered pages include all the great info that's in your README

Also if you need help with any of this please feel free to ask questions here: https://pyopensci.discourse.group/

Hi again @remytuyeras, I forgot to say: I noticed in your README you instruct users to install scipy. When I search your repo I don't find any usages of it inside HaploDynamics though.

It's not strictly required for submission, but you probably want to replace your setup.py with a pyproject.toml file that explicitly declares your dependencies in the metadata section as discussed here: https://www.pyopensci.org/python-package-guide/package-structure-code/pyproject-toml-python-package-metadata.html. If you do use scipy, then you'll want to declare it as a dependency there. This would also allow you to remove the section in the README instructing users to install scipy, since it woiuld be automatically installed as a dependency of HaploDynamics.

Please also check out our guide on build tools here: https://www.pyopensci.org/python-package-guide/package-structure-code/python-package-build-tools.html

Last but not least, I would suggest careful use of namespaces to indicate to your users how they should use your API, and so that their use of your library will be very clear within their own scripts. Please see the following: https://benhoyt.com/writings/python-api-design/#module-and-package-structure and also http://blog.nicholdav.info/four-tips-structuring-research-python/

Accordingly, instead of your current imports, I would import the core classes inside haplodynamics.__init__.py so that a user can do the following:

import haplodynamics
#Start your simulation
model = haplodynamics.Model("tutorial")
#Initialize the genomic landscape
model.initiate_landscape(reference = 1.245)
#Design your own genomic landscape with any allele frequency model
model.extend_landscape(*(haplodynamics.Model.standard_schema(20) for _ in range(6)))

Hi @NickleDave,

Thank you for your thorough feedback, this is very much appreciated!

I will focus on setting up a documentation and will get back to pyOpenSci when this is done.

I would like to address some of the points made in the second message by providing some complementary information.

The SciPy package is used in the HaploDynamics/HaploDX/haploDX.py module to compute Pearson correlation scores. This is done by importing the scipy.stats submodule as stat and then calling the function stat.pearsonr() inside the function LD_corr_matrix().
I intentionally distinguished the modules HaploDX and Framework. HaploDX is an introductory module that is not optimized, but it is easier to start with. It stems from the tutorial published on my personal webpage. Anyone interested in learning the framework can experiment with HaploDynamics.HaploDX while following the tutorial. The module Framework is specifically optimized for research and is intended for researchers who need the most performance. This is why I wanted different import instructions for the two modules. However, both modules have the same underlying principles.

I also have a question regarding the difference between pyproject.toml and setup.py:

In the README, my intention was to let the user know that they need to install scipy only if they want to set up the PYTHONPATH variable to the package manually. Otherwise, I included SciPy in setup.py using install_requires=['scipy'] and thought that this inclusion would automatically install the dependency when using pip. After your suggestion, I am wondering whether this is a misconception, or is there any other advantage to using pyproject.toml in the context of pyOpenSci?

Please let me know if you have any feedback given the additional information above. Any tip or advice is appreciated. I'm always looking for ways to improve the package design.

Hi again @remytuyeras

I will focus on setting up a documentation and will get back to pyOpenSci when this is done.

Great. Thank you for working on that. Just so we know where we're at, I added the "pending-maintainer-response" label to this issue.

The SciPy package is used in the HaploDynamics/HaploDX/haploDX.py module Otherwise, I included SciPy in setup.py using install_requires=['scipy'] and thought that this inclusion would automatically install the dependency when using pip

I'm sorry, it's my fault for not reading the setup.py more closely. I tested that I could pip install HaploDynamics and I did see that scipy was installed and I could successfully import HaploDynamics.

I intentionally distinguished the modules HaploDX and Framework.

This is a really great design choice, and I think it's very helpful to provide easy-to-read pure Python implementations as well as more efficient implementations that e.g. are vectorized with NumPy.

I also see now that you do import these modules at the package level inside HaploDynamics.__init__.py -- again, my fault for not reading more carefully.

That said, I would strongly suggest you use lowercase for both the package and module names, to avoid confusing people reading code that uses HaploDynamics, who will mistake them for classes because you have used the convention for naming classes. Please see the PEP8 conventions for package and module names and class names. Of course we know that "a foolish consistency is the hobgoblin of small minds" but here we want to be consistent so people can easily read all the code that's going to be using your library! You don't want to make all those people mad at you 🙂

is there any other advantage to using pyproject.toml

There are many advantages.
The main one is that it lets you declare what tools to use to build your package, in static metadata instead of code. It also lets you declare additional metadata associated with the package, that other tools can take advantage of. Most other modern languages use static files for this purpose instead of code. Avoiding the use of setup.py also prevents an entire class of security vulnerabilities, since we are no longer executing code in setup.py when building a package for installation. The only reason you may need a setup.py is if your package is not pure Python and has a more complex build--in your case you only have two dependencies, numpy and scipy, for which wheels are readily available, so you do not need a setup.py file.

Does that help?

Hi @NickleDave ,

Yes, it helps a lot. I will work on the documentation and integrate your feedback above.

Thanks!

Best, Remy

@remytuyeras I am going to go ahead and closed this since we have told you that, yes, HaploDynamics is in scope.

Please do reference this issue when you make the full submission (there's a place to do so in the template you'll fill out). Thank you!

pyOpenSci / software-submission