sarahheayoon / Topological-Data-Analysis

Ken Cooke Summer Research (Pomona College Math Department)
0 stars 0 forks source link

Topological-Data-Analysis

2022 Ken Cooke Summer Research Fellowship

Advisor: Professor Vin de Silva

Motivation behind the research:

The primary goal of this proposed research is to finish up work initiated in my senior thesis this year, “Topological Modeling of Higher-Dimensional Complex Data.” Topology is a tool to understand higher dimensional vector spaces using toplogical invariants such as homology and the Euler characteristics. My motivation for the thesis was to understand a different approach to modeling complex data. The classes that influenced the narrative of my thesis were MATH154 ‘Computational Statistics’ and MATH 158 ‘Statistical Linear Modeling,’ which I took in my senior year. I was interested in the ways computational statistics intersects with deterministic/stochastic mathematical modeling. With the thesis (MATH 191), I was able to understand the fundamentals of topology and how it can be used to model and understand higher-dimensional shapes. Originally, the last chapter of the thesis was dedicated to applications in the real world. I was curious how topological modeling of higher-dimensional data can be useful in understanding complex data shapes. I was curious how these topological methods might be better, or worse, in certain areas of data analysis compared to the traditional machine learning methods in visualizing data and predicting an outcome.

This repository includes three sections.

Research Summary & Takeaways:

With Ken Cooke Fellowship, I was able to continue researching topology and explore its applications in time-series and spatial-series modelling. Working closely with Professor Vin, one of the highlights for the research was creating a Python module called the ‘Min-Max algorithm,’ which is an integral part of the cohomology Python package. Cohomology is a method of assigning algebraic invariants to vector spaces (cochains and chains) to understand the topology of a given object. Ultimately, with cohomology, one may parse signals from noisy time-series and spatial-series data by integrating cochains over a given path and calculating its difference. During the research, I was able to expand my understandings in topology and I was fortunate enough to get in touch with Prof. Vin’s fellow colleague and her phD students, Anna and Yueqi, who created a Python library for homology. Extending from their homology package, I was able to create a cohomology python module.

Besides gaining hands-on experience with applications of topological modelling, one other important takeaways from the research was the ability to hone technical skills (i.e. machine learning methods in R and Python). Research was essentially splitted into two parts. On one hand, I was reading papers about topology. I also took time to read chapters in “Practical Time Series Analysis: Prediction With Statistics and Machine Learning” to understand how time-series data are traditionally handled and what types of questions these data pose. On the other hand, I had coding exercises from both the textbook and assignments from Prof. Vin. I spent time watching Youtube lectures and reading GeeksforGeeks and Stackoverflow forums to understand time-series modelling and coding. As an aspiring data-scientist, I am grateful that I was able to hone my coding skills during the research. I was able to translate my knowledge in R/ MATLAB to python with packages like sklearn, pandas, numpy. Yet, to my surprise, my biggest takeaway from the research is that in order to become a better statistician and a data scientist, asking good questions and coming up with action plans to solve these problems are as important, if not more, than the ability to code.

Last but not least, I would like to thank the Cooke fellowship committee, my advisor Prof.Vin, as well as the family and the donor for giving me this opportunity to learn. I am so grateful that I was able to continue with my thesis even after graduation.