stdlib-js / google-summer-of-code

Google Summer of Code resources.
https://github.com/stdlib-js/stdlib
23 stars 5 forks source link

[RFC]: Implement a broader range of statistical distributions #49

Closed AgPriyanshu18 closed 2 months ago

AgPriyanshu18 commented 3 months ago

Full name

Priyanshu Agarwal

University status

Yes

University name

Indian Institute of Information Technology Jabalpur

University program

Bachelors of Technology in Computer Science and Engineering

Expected graduation

01-08-2025

Short biography

I am Priyanshu Agarwal, a pre-final year Computer Science and Engineering student at the Indian Institute of Information Technology Jabalpur. My background is complemented by hands-on experience in development with languages and frameworks like JavaScript and Node.js, mobile development in Kotlin for Android, and core programming languages C/C++ and Java. Internships, freelance projects, and active contributions to college projects further strengthen this experience. My passion for continuous learning led me to explore new frameworks and technologies within these domains, solidifying my practical understanding.

Beyond web and mobile development, I hold a strong interest in data structures, algorithms, and object-oriented programming (OOPs), evident from my coursework (Advanced Probability, Data Science, Data Structures & Algorithms, Theory of computation) and participation in coding platforms like LeetCode and Codeforces. These platforms have been valuable tools to hone my problem-solving skills and enhance my proficiency in these critical areas. Additionally, I recently began contributing to STDLIB, immersing myself in their extensive codebase and actively engaging in development tasks. This experience has provided invaluable insights into a larger codebase structure and collaborative development practices, further enriching my skillset.

Timezone

Indian Standard Time ( IST ) , (GMT+ 5:30)

Contact details

Email : agpriyanshu18@gmail.com / 21bcs167@iiitdmj.ac.in Phone number: 6378228784 Linkedin: https://www.linkedin.com/in/priyanshu-agarwal-484151220/ Github: https://github.com/AgPriyanshu18

Platform

Linux

Editor

My development environment utilizes Ubuntu, Visual Studio Code (VS Code) serves as my primary editor, offering extensive built-in JavaScript support for unit testing, formatting, and other essential tools. This combination ensures efficient development, high-quality code, and seamless collaboration via VS Code's Git integration. While some might prefer macOS for web development, Ubuntu provides a robust environment perfect for this project, allowing me to be productive and contribute effectively from the start.

Programming experience

I have been actively expanding my programming skillset since High School, focusing on JavaScript, C, C++, Java, Kotlin, and Node.js. This has resulted in a strong understanding of these languages and their associated libraries and frameworks. I am also a competitive programming enthusiast solved over 300 problems in total on several platforms. To demonstrate my abilities, I'd like to highlight my relevant project:

  1. Sevayu: This project focused on digitalizing hospital services via a subscription model. Sevayu offered features like online appointment booking, detailed medicine information, and personalized functionalities. Here, I delved into data science applications within JavaScript. To personalize functionalities, I implemented recommendation algorithms based on user medical history and appointment data.

  2. Pocket Manager: is a Kotlin/XML/Firebase expense tracker that simplifies expense recording with accuracy. Users gain insights through interactive charts and budgeting tools powered by algorithms that crunch spending data to reveal spending trends.

These projects showcase my proficiency in working with web APIs, real-time data, and data visualization within a JavaScript environment, all of which are relevant skills for this data analysis library project.

JavaScript experience

I've been actively honing my JavaScript skills. This journey has equipped me with a comprehensive understanding of the language's syntax, features, and best practices. I've delved into many projects, from building interactive web applications to crafting server-side functionalities using Node.js like seyavu. What truly excites me about JavaScript is its unique blend of flexibility and power. First-class functions allow me to treat functions like variables, promoting code reusability and a more functional programming style. Dynamic typing, while requiring careful attention, streamlines development, and rapid prototyping.Seyavu

Node.js experience

My passion for backend development led me to delve into Node.js. This powerful platform, combined with the popular Express framework, has become my go-to for building web applications. I've honed my skills by crafting backends for projects like a Sevayu. These real-world experiences solidified my understanding of Node.js while equipping me to solve complex problems effectively. Further strengthening my backend expertise, I've actively implemented various databases, including MySQL, and MongoDB, to tailor data storage solutions for each project. Additionally, I've gained proficiency in writing middleware for authentication and routing, ensuring streamlined user access and website navigation.

C/Fortran experience

My programming journey began with C/C++ in high school, where I discovered the power of low-level languages. C's compactness thrived in direct hardware control, perfectly suited for projects like Arduino programming. Its lightning-fast execution speed also made it my weapon of choice in competitive programming. While my experience with Fortran isn't as extensive as with JavaScript or C/C++, I've gained a solid understanding of its strengths in the realm of scientific computing. Fortran's optimized support for numerical operations and arrays makes it a natural choice for tasks like linear algebra and simulations.

Interest in stdlib

The Stdlib library offers a robust collection of utility functions and modules that streamline development efforts within the Node.js ecosystem. It encompasses a wide range of functionalities, including mathematical and statistical computations, and file system operations. This comprehensive toolkit empowers developers of varying skill sets to efficiently address diverse project requirements. Furthermore, Stdlib prioritizes excellence, performance optimization, and adherence to established industry standards. This unwavering commitment fosters trust and widespread adoption within the Node.js community. This has make me interested in this organization and motivated me to contribute.

Version control

Yes

Contributions to stdlib

Merged feat: Added utils/none-in-by #1416 feat: add array/base/count-same-value-zero #1384

Open refactors blas/ext/base/sfill to follow current projects convention #1809 refactors blas/ext/base/snansumpw to follow current project conventions #1711

Goals

The main goal of this project is to implement a broader range of statistical distributions that are present in the SciPy library directly within the Stdlib library. Currently, Stdlib users who need to work with a wider variety of statistical distributions often rely on external libraries like SciPy. This integration project aims to eliminate that dependency by bringing the power of SciPy's stats module right into Stdlib. By implementing all the distributions found in SciPy's stats module, Stdlib users will have access to a significantly richer toolkit for statistical analysis. They'll be able to calculate key distribution properties like PDFs, CDFs, and quantiles, all without leaving the Stdlib environment. Additionally, the ability to generate random variates based on these distributions will further enhance Stdlib's capabilities for data simulation tasks. This expanded functionality will streamline the workflow for developers working on data analysis projects in JavaScript, allowing them to focus on their core analysis tasks without worrying about managing external libraries.

-Approach - To accomplish this project, I have tried to categorize all the major Cumulative and Discrete Distributions based on their complexity to implement and the need for utility functions for their smooth implementations. First Category - This includes distributions whose implementations are straightforward and all the required dependencies are present in stdlib/math/base/special, I have tried to make order in terms of complexity of distribution formula. Boltzmann, Bradford, half norm, lognormal, Argus, Plank, Dogum, Gibrat, Rademacher, Inverse Weibull, Log logistic, Angelit, logamma, Gompertz, Fold Cauchy, Half cauchy, Half normal, Half logistics. Second Catogery - The distributions in this category have a bit of complexity in implementation and need to implement some utility and math functions for smooth implementation. They may need first-category functions for easier implementation too. CystalBall, Burr(III), Double Weibull, Double Laplace, Maxwell, Zipf, Von Misses, Wald, Non-Central Chi-square, Rice, Studentized Range, Skellam, fariguelife.

Why this project?

I'm highly motivated by the opportunity to contribute to a prominent open-source organization like Stdlib. The chance to collaborate with a community developing such an important Javascript library is a great learning opportunity and great for my resume too. I have had a nag for mathematics and data science since High school days, I like to learn about special functions and distributions which are used in deriving results from given data and now getting a chance to write those functions in a library in a prominent language like Javascript is a too good opportunity. Two reasons which excite me most are - Empowering the Javascript Community: Currently, Javascript developers often rely on external libraries like SciPy for comprehensive statistical analysis. This project has the potential to transform Stdlib into a powerhouse for statistical computing within Javascript. By integrating the vast array of distributions from SciPy's stats module, we can equip the entire Javascript data science community with a powerful toolkit directly within their preferred environment. Imagine the possibilities! Expanding My Javascript and Data Science Skillset: Participating in this project presents a fantastic opportunity to deepen my understanding of both Javascript and data science concepts. Working on the implementation of these statistical distributions will not only enhance my programming skills but also solidify my grasp of various statistical methodologies. This project is a win-win, allowing me to contribute to a valuable library while simultaneously strengthening my skillset. Overall, the chance to contribute to a prominent community project like this, while simultaneously expanding my knowledge in both Javascript and data science, is incredibly motivating. I'm eager to dive in and be a part of making Stdlib an even more powerful tool for web-based numerical computing.

Qualifications

I have good experience in Javascript from working on my projects and internships, I also know Python by developing projects for AI courses. I am well versed in object-oriented I am well versed in c/c++ as I have been practicing data structure and algorithms since starting of college. My academic background has equipped me with a strong foundation for working on this project. Completing a data science course provided a solid understanding of statistical concepts and their applications. Additionally, coursework in the Theory of computation(TOC) and advanced probability has solidified my technical foundation for the algorithmic challenges involved in integrating SciPy's distributions. My contribution to STDLIB has provided me with familiarity with the codebase and the ability to work more and enhance its usability to the extent of my abilities. Thus, I find myself a good fit for the project, with my experience and skills in programming and the knowledge I have acquired in my academic journey. I think i can complete this project.

Prior art

On researching and understanding the project I have the following observations.

  1. We need to implement all univariate distributions which include all Probability Distribution functions (PDFs), Continuous distribution functions (CDFs), and Discrete distribution functions present in the SciPy library.
  2. There are many discrete and continuous distribution functions already implemented in stdlib and APIs for random variable generation.

These references are enough to properly understand what must be done in this project, and how should proceed.

Commitment

1 May - 26 May -> Bonding Period 27 May - 7 July -> 40 hours/week ( 40 6 ) 8 July - 17 August -> 20 hours/week ( 20 6 ) Total = 240 + 120 = 340 hours

I don’t plan to take vacations.

Schedule

Assuming a 12-week schedule,

Notes:

Related issues

No

Checklist

kgryte commented 3 months ago

@AgPriyanshu18 Thank you for opening this RFC. One suggestion I have is that you be more specific in which distributions you plan to add and when. As a start, you can investigate those mentioned on the stdlib issue tracker: https://github.com/stdlib-js/stdlib/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc+distribution+label%3AStatistics.

I'd also suggest exploring a bit more what SciPy has to offer, reading the source code therein, and determining which distributions will be most straightforward to implement. Importantly, you'll want to start off your work working on those distributions for which we have all the requisite functionality (e.g., special math functions, utilities, etc). Otherwise, if you start off by working on a distribution for which we don't already have the prereqs, you'll quickly be blocked and won't be able to make much progress.

As such, I strongly suggest doing a bit more R&D so that your timeline is as informed as possible.

kgryte commented 3 months ago

In fact, to ensure that this project is aligned with your interests, I suggest, if possible, actually trying to find a suitable distribution to add to stdlib now and trying to do so. It will be to your benefit to have a good understanding as to what you'd be signing up for in pursuing this project.

AgPriyanshu18 commented 3 months ago

Thank you @kgryte for your review, And am sorry for such a late reply. I have completed RnD over all the Continuous Distribution Functions and Discrete Distribution Functions present in SciPy, including distributions for which issues are open. I am very Thankful for your suggestion to read the source code of each distribution as well, it has taken some time though it has given me a better understanding of how the distributions are implemented in SciPy, like by reusing one distribution code in generating functional properties for other distributions.

I have arranged the functions in two categories based on their complexity of implementation and the need for other utility functions, I have mentioned those under project goals. I have included them in my timeline as well as you have asked me. Also, after going through the source of codes of all the distributions I have mentioned above, I get a fair idea about mathematical and utility functions needed.

AgPriyanshu18 commented 3 months ago

@kgryte I would like to take this issue for Botzman distribution to work on.