mikeizbicki / cmc-csci046

CMC's Data Structures and Algorithms Course Materials
55 stars 155 forks source link

Research Projects #467

Open mikeizbicki opened 1 year ago

mikeizbicki commented 1 year ago

CMC has a summer research program that pays students to work on research projects with faculty over the summer. I've got a handful of projects that CS46 students would be able to contribute to, and I'd love to work with some of you over the summer on these projects.

If you have any questions about these projects, or want to talk about other research possibilities, please ask!

Project 1: Improve Unicode support for the North Korean language.

Modern computers use a system called Unicode for representing non-English text on computers. Unfortunately, Unicode currently only supports the Korean language as used in South Korea and has limited support for the North Korean dialect. For example, there are certain characters used in North Korean documents that cannot currently be represented on Western computers. This limitation makes it harder for North Koreans to connect to the internet, and for North Korean diplomats to exchange documents with foreign diplomats. The purpose of this project is to add some of these missing features into the Unicode standard. This will enable standard tools like Microsoft Office and Python to work with North Korean text.

You can find more background about this project in Section 1 of the PUST Computer Science and International Standards document.

Project 2: Archival of North Korean webpages.

North Korea uses its webpages for communication with the outside world. For example, the Korean Central News Agency (KCNA) is North Korea's primary venue for distributing official government policies, and they post these policies online at http://kcna.kp. Western political analysts rely on access to this website in order to understand the DPRK's official policy positions. Unfortunately, these webpages have technical problems that prevent Google from searching them or archiving them, and analysts therefore cannot easily research historic North Korean policy. The purpose of this project is to develop custom tools that work around the problems in the North Korean webpages to allow archiving and search.

You can find more background about this project in Section 3 of the PUST Computer Science and International Standards document or this blog post about fixing the KCNA webpage.

Project 3: Teach ChatGPT to read/write Latin/Spanish/German/etc.

ChatGPT is a program that understands English language text. It was created by training a machine learning model on essentially all English language text ever written. A major challenge in extending this technique to non-English languages is that there is much less text in these languages, and good language models require lots of training data. One way to work around this limitation is to try to take advantage of ChatGPT's existing understanding of English. For example, we have good textbooks that can teach a native English speaker how to read and write in a foreign language, and we could feed these textbooks into ChatGPT to teach it to read/write in these other languages. I'm particularly interested in teaching ChatGPT to understand Latin, but I'm also open to working in other languages that students would be interested in.

mikeizbicki commented 1 year ago

Some answers to common questions:

  1. If you're interested in working on these problems outside of the official SRP channel, then talk to me in person and maybe we can arrange something.

  2. In the past, the SRP has only been for CMC students. I've requested that non-CMC students be allowed to participate, but we won't know until later. The program hasn't yet been officially announced for this year.

tnyamuronda commented 1 year ago

@mikeizbicki when do the research projects usually start, immediately after the end of the semester or ...?