[RFC]: Implement a broader range of statistical distributions

AgPriyanshu18 commented 3 months ago

Full name

Priyanshu Agarwal

University status

Yes

University name

Indian Institute of Information Technology Jabalpur

University program

Bachelors of Technology in Computer Science and Engineering

Expected graduation

01-08-2025

Short biography

I am Priyanshu Agarwal, a pre-final year Computer Science and Engineering student at the Indian Institute of Information Technology Jabalpur. My background is complemented by hands-on experience in development with languages and frameworks like JavaScript and Node.js, mobile development in Kotlin for Android, and core programming languages C/C++ and Java. Internships, freelance projects, and active contributions to college projects further strengthen this experience. My passion for continuous learning led me to explore new frameworks and technologies within these domains, solidifying my practical understanding.

Beyond web and mobile development, I hold a strong interest in data structures, algorithms, and object-oriented programming (OOPs), evident from my coursework (Advanced Probability, Data Science, Data Structures & Algorithms, Theory of computation) and participation in coding platforms like LeetCode and Codeforces. These platforms have been valuable tools to hone my problem-solving skills and enhance my proficiency in these critical areas. Additionally, I recently began contributing to STDLIB, immersing myself in their extensive codebase and actively engaging in development tasks. This experience has provided invaluable insights into a larger codebase structure and collaborative development practices, further enriching my skillset.

Timezone

Indian Standard Time ( IST ) , (GMT+ 5:30)

Contact details

Email : agpriyanshu18@gmail.com / 21bcs167@iiitdmj.ac.in Phone number: 6378228784 Linkedin: https://www.linkedin.com/in/priyanshu-agarwal-484151220/ Github: https://github.com/AgPriyanshu18

Platform

Linux

Editor

My development environment utilizes Ubuntu, Visual Studio Code (VS Code) serves as my primary editor, offering extensive built-in JavaScript support for unit testing, formatting, and other essential tools. This combination ensures efficient development, high-quality code, and seamless collaboration via VS Code's Git integration. While some might prefer macOS for web development, Ubuntu provides a robust environment perfect for this project, allowing me to be productive and contribute effectively from the start.

Programming experience

I have been actively expanding my programming skillset since High School, focusing on JavaScript, C, C++, Java, Kotlin, and Node.js. This has resulted in a strong understanding of these languages and their associated libraries and frameworks. I am also a competitive programming enthusiast solved over 300 problems in total on several platforms. To demonstrate my abilities, I'd like to highlight my relevant project:

Sevayu: This project focused on digitalizing hospital services via a subscription model. Sevayu offered features like online appointment booking, detailed medicine information, and personalized functionalities. Here, I delved into data science applications within JavaScript. To personalize functionalities, I implemented recommendation algorithms based on user medical history and appointment data.
Pocket Manager: is a Kotlin/XML/Firebase expense tracker that simplifies expense recording with accuracy. Users gain insights through interactive charts and budgeting tools powered by algorithms that crunch spending data to reveal spending trends.

These projects showcase my proficiency in working with web APIs, real-time data, and data visualization within a JavaScript environment, all of which are relevant skills for this data analysis library project.

JavaScript experience

I've been actively honing my JavaScript skills. This journey has equipped me with a comprehensive understanding of the language's syntax, features, and best practices. I've delved into many projects, from building interactive web applications to crafting server-side functionalities using Node.js like seyavu. What truly excites me about JavaScript is its unique blend of flexibility and power. First-class functions allow me to treat functions like variables, promoting code reusability and a more functional programming style. Dynamic typing, while requiring careful attention, streamlines development, and rapid prototyping.Seyavu

Node.js experience

My passion for backend development led me to delve into Node.js. This powerful platform, combined with the popular Express framework, has become my go-to for building web applications. I've honed my skills by crafting backends for projects like a Sevayu. These real-world experiences solidified my understanding of Node.js while equipping me to solve complex problems effectively. Further strengthening my backend expertise, I've actively implemented various databases, including MySQL, and MongoDB, to tailor data storage solutions for each project. Additionally, I've gained proficiency in writing middleware for authentication and routing, ensuring streamlined user access and website navigation.

C/Fortran experience

My programming journey began with C/C++ in high school, where I discovered the power of low-level languages. C's compactness thrived in direct hardware control, perfectly suited for projects like Arduino programming. Its lightning-fast execution speed also made it my weapon of choice in competitive programming. While my experience with Fortran isn't as extensive as with JavaScript or C/C++, I've gained a solid understanding of its strengths in the realm of scientific computing. Fortran's optimized support for numerical operations and arrays makes it a natural choice for tasks like linear algebra and simulations.

Interest in stdlib

The Stdlib library offers a robust collection of utility functions and modules that streamline development efforts within the Node.js ecosystem. It encompasses a wide range of functionalities, including mathematical and statistical computations, and file system operations. This comprehensive toolkit empowers developers of varying skill sets to efficiently address diverse project requirements. Furthermore, Stdlib prioritizes excellence, performance optimization, and adherence to established industry standards. This unwavering commitment fosters trust and widespread adoption within the Node.js community. This has make me interested in this organization and motivated me to contribute.

Version control

Yes

Contributions to stdlib

Merged feat: Added utils/none-in-by #1416 feat: add array/base/count-same-value-zero #1384

Open refactors blas/ext/base/sfill to follow current projects convention #1809 refactors blas/ext/base/snansumpw to follow current project conventions #1711

Goals

The main goal of this project is to implement a broader range of statistical distributions that are present in the SciPy library directly within the Stdlib library. Currently, Stdlib users who need to work with a wider variety of statistical distributions often rely on external libraries like SciPy. This integration project aims to eliminate that dependency by bringing the power of SciPy's stats module right into Stdlib. By implementing all the distributions found in SciPy's stats module, Stdlib users will have access to a significantly richer toolkit for statistical analysis. They'll be able to calculate key distribution properties like PDFs, CDFs, and quantiles, all without leaving the Stdlib environment. Additionally, the ability to generate random variates based on these distributions will further enhance Stdlib's capabilities for data simulation tasks. This expanded functionality will streamline the workflow for developers working on data analysis projects in JavaScript, allowing them to focus on their core analysis tasks without worrying about managing external libraries.

-Approach - To accomplish this project, I have tried to categorize all the major Cumulative and Discrete Distributions based on their complexity to implement and the need for utility functions for their smooth implementations. First Category - This includes distributions whose implementations are straightforward and all the required dependencies are present in stdlib/math/base/special, I have tried to make order in terms of complexity of distribution formula. Boltzmann, Bradford, half norm, lognormal, Argus, Plank, Dogum, Gibrat, Rademacher, Inverse Weibull, Log logistic, Angelit, logamma, Gompertz, Fold Cauchy, Half cauchy, Half normal, Half logistics. Second Catogery - The distributions in this category have a bit of complexity in implementation and need to implement some utility and math functions for smooth implementation. They may need first-category functions for easier implementation too. CystalBall, Burr(III), Double Weibull, Double Laplace, Maxwell, Zipf, Von Misses, Wald, Non-Central Chi-square, Rice, Studentized Range, Skellam, fariguelife.

Why this project?

I'm highly motivated by the opportunity to contribute to a prominent open-source organization like Stdlib. The chance to collaborate with a community developing such an important Javascript library is a great learning opportunity and great for my resume too. I have had a nag for mathematics and data science since High school days, I like to learn about special functions and distributions which are used in deriving results from given data and now getting a chance to write those functions in a library in a prominent language like Javascript is a too good opportunity. Two reasons which excite me most are - Empowering the Javascript Community: Currently, Javascript developers often rely on external libraries like SciPy for comprehensive statistical analysis. This project has the potential to transform Stdlib into a powerhouse for statistical computing within Javascript. By integrating the vast array of distributions from SciPy's stats module, we can equip the entire Javascript data science community with a powerful toolkit directly within their preferred environment. Imagine the possibilities! Expanding My Javascript and Data Science Skillset: Participating in this project presents a fantastic opportunity to deepen my understanding of both Javascript and data science concepts. Working on the implementation of these statistical distributions will not only enhance my programming skills but also solidify my grasp of various statistical methodologies. This project is a win-win, allowing me to contribute to a valuable library while simultaneously strengthening my skillset. Overall, the chance to contribute to a prominent community project like this, while simultaneously expanding my knowledge in both Javascript and data science, is incredibly motivating. I'm eager to dive in and be a part of making Stdlib an even more powerful tool for web-based numerical computing.

Qualifications

I have good experience in Javascript from working on my projects and internships, I also know Python by developing projects for AI courses. I am well versed in object-oriented I am well versed in c/c++ as I have been practicing data structure and algorithms since starting of college. My academic background has equipped me with a strong foundation for working on this project. Completing a data science course provided a solid understanding of statistical concepts and their applications. Additionally, coursework in the Theory of computation(TOC) and advanced probability has solidified my technical foundation for the algorithmic challenges involved in integrating SciPy's distributions. My contribution to STDLIB has provided me with familiarity with the codebase and the ability to work more and enhance its usability to the extent of my abilities. Thus, I find myself a good fit for the project, with my experience and skills in programming and the knowledge I have acquired in my academic journey. I think i can complete this project.

Prior art

On researching and understanding the project I have the following observations.

We need to implement all univariate distributions which include all Probability Distribution functions (PDFs), Continuous distribution functions (CDFs), and Discrete distribution functions present in the SciPy library.
There are many discrete and continuous distribution functions already implemented in stdlib and APIs for random variable generation.

These references are enough to properly understand what must be done in this project, and how should proceed.

Commitment

1 May - 26 May -> Bonding Period 27 May - 7 July -> 40 hours/week ( 40 6 ) 8 July - 17 August -> 20 hours/week ( 20 6 ) Total = 240 + 120 = 340 hours

I don’t plan to take vacations.

Schedule

Assuming a 12-week schedule,

Community Bonding Period: During this period I will work on implementing distribution functions with a straightforward approach that already has the necessary utilities and mathematical functions. I will take Boltzmann, Bradford, half normal and log-normal distributions, and others if possible for the starting. I would further implement the API for drawing random variates and consult with the mentors. And I would like to share my views on which properties of each function can be implemented, and some better ways to make documentation and write tests using the SciPy library.
Week 1 and Week 2: I will start with the starting of the first category of functions that have their dependencies implemented. I will complete distributions in the First Category which includes - Argus, Plank, Dagum, Gibrat, inverse Weibull, Loglogistic, and Fold Cauchy. All of these distributions have their respective issues already opened and completely unaddressed till now. I will also implement APIs for drawing random variates. Then, I will evaluate the prepared packages with mentors and discuss the issues if any. I will try my best to take more distribution on successful completion of above mentioned.
Week 3: During this week, I will fix the bugs or issues found in the previous implementation and will work to perfect the previous implementation as a base for all further implementations. I will also start working with further distribution functions.
Week 4 and Week 5: In this period, I will start to work on functions whose implementation is not very straightforward and falls under the first category of distribution functions. I will start with properly understanding the distribution function and its properties and then will implement the functions according to the previously perfected packages.
Week 6: (midterm) My goal will be to complete most of the distribution function in the first Category and prepare by midterm for evaluation and complete any backlog present or resolve any other issues or bugs.
Week 7 and Week 8: After gaining experience from the first category functions, I will start with the beginning functions of the Second category, which include - CystalBall, Burr(III), Double Weibull, Double Laplace, Maxwell, and Zipf. Here, distributions like Maxwell need gammaincc as a utility function which is not present, so I will add these mathematical functions as well. I will also incorporate APIs for drawing random variates.
Week 9 and Week 10: During this week, I plan to end the work on distribution functions with complete documentation and testing. I will complete the rest of the issues present in the Second Category and will address the issues that arise if any during the reviewing of my work.
Week 11 and Week 12: In this phase, I will complete any backlogs if present. Thoroughly test complete functionality developed with every possible scenario, and resolve bugs if any. Add any additional APIs related to distributions after discussions with the mentor if found.
Final Week: I will complete the project and wrap up all the things during this week submit my work and take suggestions from mentors.

Notes:

The community bonding period is a 3 week period built into GSoC to help you get to know the project community and participate in project discussion. This is an opportunity for you to setup your local development environment, learn how the project's source control works, refine your project plan, read any necessary documentation, and otherwise prepare to execute on your project project proposal.
Usually, even week 1 deliverables include some code.
By week 6, you need enough done at this point for your mentor to evaluate your progress and pass you. Usually, you want to be a bit more than halfway done.
By week 11, you may want to "code freeze" and focus on completing any tests and/or documentation.
During the final week, you'll be submitting your project.

Related issues

No

Checklist

[X] I have read and understood the Code of Conduct.
[X] I have read and understood the application materials found in this repository.
[X] I understand that plagiarism will not be tolerated, and I have authored this application in my own words.
[X] I have read and understood the patch requirement which is necessary for my application to be considered for acceptance.
[X] The issue name begins with [RFC]: and succinctly describes your proposal.
[X] I understand that, in order to apply to be a GSoC contributor, I must submit my final application to https://summerofcode.withgoogle.com/ before the submission deadline.

kgryte commented 3 months ago

@AgPriyanshu18 Thank you for opening this RFC. One suggestion I have is that you be more specific in which distributions you plan to add and when. As a start, you can investigate those mentioned on the stdlib issue tracker: https://github.com/stdlib-js/stdlib/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc+distribution+label%3AStatistics.

I'd also suggest exploring a bit more what SciPy has to offer, reading the source code therein, and determining which distributions will be most straightforward to implement. Importantly, you'll want to start off your work working on those distributions for which we have all the requisite functionality (e.g., special math functions, utilities, etc). Otherwise, if you start off by working on a distribution for which we don't already have the prereqs, you'll quickly be blocked and won't be able to make much progress.

As such, I strongly suggest doing a bit more R&D so that your timeline is as informed as possible.

kgryte commented 3 months ago

In fact, to ensure that this project is aligned with your interests, I suggest, if possible, actually trying to find a suitable distribution to add to stdlib now and trying to do so. It will be to your benefit to have a good understanding as to what you'd be signing up for in pursuing this project.

AgPriyanshu18 commented 3 months ago

Thank you @kgryte for your review, And am sorry for such a late reply. I have completed RnD over all the Continuous Distribution Functions and Discrete Distribution Functions present in SciPy, including distributions for which issues are open. I am very Thankful for your suggestion to read the source code of each distribution as well, it has taken some time though it has given me a better understanding of how the distributions are implemented in SciPy, like by reusing one distribution code in generating functional properties for other distributions.

I have arranged the functions in two categories based on their complexity of implementation and the need for other utility functions, I have mentioned those under project goals. I have included them in my timeline as well as you have asked me. Also, after going through the source of codes of all the distributions I have mentioned above, I get a fair idea about mathematical and utility functions needed.

AgPriyanshu18 commented 3 months ago

@kgryte I would like to take this issue for Botzman distribution to work on.

stdlib-js / google-summer-of-code