pyOpenSci / software-submission

Submit your package for review by pyOpenSci here! If you have questions please post them here: https://pyopensci.discourse.group/
87 stars 31 forks source link

Presubmission Inquiry for HaploDynamics #131

Closed remytuyeras closed 8 months ago

remytuyeras commented 9 months ago

Submitting Author: Remy Tuyeras (@remytuyeras)
Package Name: HaploDynamics One-Line Description of Package: A python library to develop genomic data simulators Repository Link (if existing): https://github.com/remytuyeras/HaploDynamics


Code of Conduct & Commitment to Maintain Package

Description

The HaploDynamics package provides a collection of functions to generate simulated population-specific genomic data in VCF format. The library includes parameters and functions to control mutation rates, linkage disequilibrium strength and block lengths, and number of individuals. To generate genomic data, the HaploDX framework offers a pipeline of functions that can be used to simulate: (1) the allele frequency spectra of different populations; (2) the Hardy-Weinberg principle for genotypes and haplotypes; (3) linkage disequilibrium across different populations.

Community Partnerships

We partner with communities to support peer review with an additional layer of checks that satisfy community requirements. If your package fits into an existing community please check below:

Scope

Scope

Domain Specific & Community Partnerships

- [ ] Geospatial
- [ ] Education
- [ ] Pangeo
- [x] Unsure/Other (explain below)

I am not sure whether the package falls under one of the categories and was wondering if you editorial team could guide me on this.

The decreasing costs of DNA sequencing and cloud computing have led to a surge in the deployment of genomic pipelines on cloud platforms for international collaboration. This development has made it increasingly important to develop privacy-preserving methods that protect both patient privacy and the integrity of research that uses the data. Regulatory and privacy-preserving protocols have made genomic pipeline auditing more stringent and thorough, requiring them to be tested on simulated data to verify the validity of the algorithms that process the data. However, the process of iteratively generating and/or transferring genomic simulations on the cloud can be a major bottleneck for research.

To address this challenge, the HaploDynamics package provides a cloud-native framework for generating fast and lightweight simulations of genomic data that can replicate its complexity. The package also allows researchers to add their own generative mutation models to the framework, making it possible to create customized simulations that meet specific research needs.

MaCS (coded in C++) is a fast coalescence-based method for generating genomic data. However, its use of graph structures to encode segregation and recombination events can slow down its execution time.

Both MaCS and HaploDX rely on Bayesian networks to generate variants, which allows them to output rows of genotypes sequentially in a similar fashion. However, HaploDX does not use graph structures, relying instead on additional mathematical considerations. This difference allows HaploDX to outperform MaCS on execution times and memory usage.

Other packages, such as Sim1000G (coded in R), also rely on coalescence/pedigree structures, which can impair their execution times (and memory usage).

Ref:

I am wondering if the HaploDynamics package fits the scope of the pyOpenSci project. If not, I would like to know if you have any recommendations for other packages or projects that I should consider.

Further, while HaploDynamics is fully functional, it is also under active development of new functionalities. Is this something that would be compatible with the pyOpenSci project?

P.S. Have feedback/comments about our review process? Leave a comment here

NickleDave commented 9 months ago

Welcome to the pyOpenSci community @remytuyeras.

Yes, at first glance, HaploDynamics looks like it is in scope.

What would help us determine this would be complete documentation for the project, including examples that directly use HaploDynamics itself, that are built and rendered as a separate site.

Such documentation would also be required before a full submission.

I do see you have documentation in docs/source and that you point to a blog post on your personal site. As far as I can tell the blog post does not specifically show usage of HaploDynamics itself, and is instead a more general tutorial on simulating pop genomics with Python. It's great that you have written that, but for submission you will need examples usages of HaploDynamics in your documentation. I see some example snippets in docs/source but you will also want stand alone examples of "how-tos" or "walkthroughs" such as this and this in the pynteny documentation.

What you will need to do is:

Please let me know if that is clear.
I'll be happy to have a look and get more feedback from our editorial board on whether HaploDynamics is in scope once you have full documentation that will help us better understand the intended audience, functionality, and usage of the package. It looks like you are most of the way there now! Just set up a build please 🙂 and make sure the rendered pages include all the great info that's in your README

Also if you need help with any of this please feel free to ask questions here: https://pyopensci.discourse.group/

NickleDave commented 9 months ago

Hi again @remytuyeras, I forgot to say: I noticed in your README you instruct users to install scipy. When I search your repo I don't find any usages of it inside HaploDynamics though.

It's not strictly required for submission, but you probably want to replace your setup.py with a pyproject.toml file that explicitly declares your dependencies in the metadata section as discussed here: https://www.pyopensci.org/python-package-guide/package-structure-code/pyproject-toml-python-package-metadata.html. If you do use scipy, then you'll want to declare it as a dependency there. This would also allow you to remove the section in the README instructing users to install scipy, since it woiuld be automatically installed as a dependency of HaploDynamics.

Please also check out our guide on build tools here: https://www.pyopensci.org/python-package-guide/package-structure-code/python-package-build-tools.html

Last but not least, I would suggest careful use of namespaces to indicate to your users how they should use your API, and so that their use of your library will be very clear within their own scripts. Please see the following: https://benhoyt.com/writings/python-api-design/#module-and-package-structure and also http://blog.nicholdav.info/four-tips-structuring-research-python/

Accordingly, instead of your current imports, I would import the core classes inside haplodynamics.__init__.py so that a user can do the following:

import haplodynamics
#Start your simulation
model = haplodynamics.Model("tutorial")
#Initialize the genomic landscape
model.initiate_landscape(reference = 1.245)
#Design your own genomic landscape with any allele frequency model
model.extend_landscape(*(haplodynamics.Model.standard_schema(20) for _ in range(6)))
remytuyeras commented 9 months ago

Hi @NickleDave,

Thank you for your thorough feedback, this is very much appreciated!

I will focus on setting up a documentation and will get back to pyOpenSci when this is done.

I would like to address some of the points made in the second message by providing some complementary information.

I also have a question regarding the difference between pyproject.toml and setup.py:

Please let me know if you have any feedback given the additional information above. Any tip or advice is appreciated. I'm always looking for ways to improve the package design.

NickleDave commented 9 months ago

Hi again @remytuyeras

I will focus on setting up a documentation and will get back to pyOpenSci when this is done.

Great. Thank you for working on that. Just so we know where we're at, I added the "pending-maintainer-response" label to this issue.

The SciPy package is used in the HaploDynamics/HaploDX/haploDX.py module Otherwise, I included SciPy in setup.py using install_requires=['scipy'] and thought that this inclusion would automatically install the dependency when using pip

I'm sorry, it's my fault for not reading the setup.py more closely. I tested that I could pip install HaploDynamics and I did see that scipy was installed and I could successfully import HaploDynamics.

I intentionally distinguished the modules HaploDX and Framework.

This is a really great design choice, and I think it's very helpful to provide easy-to-read pure Python implementations as well as more efficient implementations that e.g. are vectorized with NumPy.

I also see now that you do import these modules at the package level inside HaploDynamics.__init__.py -- again, my fault for not reading more carefully.

That said, I would strongly suggest you use lowercase for both the package and module names, to avoid confusing people reading code that uses HaploDynamics, who will mistake them for classes because you have used the convention for naming classes. Please see the PEP8 conventions for package and module names and class names. Of course we know that "a foolish consistency is the hobgoblin of small minds" but here we want to be consistent so people can easily read all the code that's going to be using your library! You don't want to make all those people mad at you 🙂

is there any other advantage to using pyproject.toml

There are many advantages.
The main one is that it lets you declare what tools to use to build your package, in static metadata instead of code. It also lets you declare additional metadata associated with the package, that other tools can take advantage of. Most other modern languages use static files for this purpose instead of code. Avoiding the use of setup.py also prevents an entire class of security vulnerabilities, since we are no longer executing code in setup.py when building a package for installation. The only reason you may need a setup.py is if your package is not pure Python and has a more complex build--in your case you only have two dependencies, numpy and scipy, for which wheels are readily available, so you do not need a setup.py file.

Does that help?

remytuyeras commented 9 months ago

Hi @NickleDave ,

Yes, it helps a lot. I will work on the documentation and integrate your feedback above.

Thanks!

Best, Remy

NickleDave commented 8 months ago

@remytuyeras I am going to go ahead and closed this since we have told you that, yes, HaploDynamics is in scope.

Please do reference this issue when you make the full submission (there's a place to do so in the template you'll fill out). Thank you!