CSCI143: Big Data

About the Instructor


Name	Mike Izbicki (call me Mike)
Email	mizbicki@cmc.edu
Office	Adams 216
Webpage	izbicki.me
Research	Machine Learning (see izbicki.me/research.html for some past projects)

Fun facts:

grew up in San Clemente (~1hr south of Claremont, on the beach)
7 years in the navy
1. nuclear submarine officer, personally converted >10g of uranium into pure energy
2. worked at National Security Agency (NSA)
3. left Navy as a conscientious objector
phd/postdoc at UC Riverside
taught in DPRK (i.e. North Korea)

About the Course

What is big data?

Depends entirely on the person who is talking.

Most non-computer scientists (muggles) think "too big for excel"
1. $>1000$ rows
2. $>10$ MB
Facebook considers "tens of petabytes" to be a "SMALL data problem"
One of the biggest problems in industry is people apply tools for "Facebook big data" to "muggle big data", and a major goal of this course is to teach you why this is bad and how to avoid it
For us, "big data" means:
1. too big to fit in memory
2. distributed computing helps a lot
3. datasets in the 10GB-10TB range
4. all the interesting/applied parts of upper division computer science compressed into a single course

Who should take this course?

This course is designed for data science majors, not computer science majors. I'm happy to have CS majors in this course (and I think you'll find this course fun), but know that:

you probably have not fully met the prereqs for this course
some material in this course will duplicate material in your other CS courses
1. this is especially true of CSCI133 Databases
2. the course number CSCI143 comes from the fact that all CMC upper division CS courses start with CSCI14, and the 3 is for databases

Prerequisites:

Discrete math: CSCI055 or MATH055
1. Basic probability / counting
2. Basic graph theory
Foundations of data science: CSCI 036, ECON 122, or ECON 160
1. Basic machine learning
2. Basic SQL (also covered in CSCI040 Computing for the Web; not covered in any computer science class except CSCI133 Databases, which you should not take if you take this course)
3. Regular expressions (for CS majors, typically covered in a theory of computing or compilers class)
Data structures: CSCI046 or CSCI70 (Mudd) or CSCI62 (Pomona)
1. All courses cover:
  1. Big-oh notation
  2. Balanced binary search trees
2. CSCI046 covers:
  1. Basic Unix shell commands
  2. Advanced git
  3. Vim text editor
  4. Analyzing multi-gigabyte Twitter datasets
3. Data structures pre-req CSCI040:
  1. Markdown
  2. HTML / CSS
  3. Basic SQL
  4. Programming web servers with the flask library
  5. Web scraping with the requests and bs4 libraries

Relation to other CS courses:

One purpose of this course is to provide DS majors with an overview of CS concepts. Therefore, there is a lot of material in this course that is covered in other upper division CS courses required for CS majors.

Overlapping concepts
1. CSCI105 Computer Systems (10% overlap)
  1. types of storage: tape vs HDD vs SDD vs NVME vs RAM
  2. RAID
  3. parallel vs distributed architectures
2. CSCI135 Operating Systems (10% overlap)
  1. permissions systems
  2. processes vs threads
  3. virtual machines vs containers
3. CSCI125 Networking (10% overlap)
  1. private vs public networks
  2. IP addresses
  3. TCP ports
  4. virtual networks
4. CSCI121 Software Development (10% overlap)
  1. version control systems (i.e. git)
  2. test driven development / continuous integration
  3. microservices vs monolithic architectures
  4. 12 factor applications
5. CSCI133 Databases (50% overlap)
  1. SQL
  2. ACID/MVCC/transactions
  3. indexing techniques
6. A lot of the concepts we'll be covering "should" be covered in other CS courses, but because CS professors are often more theory minded than practice minded, they don't get covered. In that sense, this course is similar to the Missing Semester of Your CS Education course taught at MIT.
Concepts we don't cover from CSCI133 Databases
1. relational algebra
2. technical implementation details / C programming
3. relationship between the database and operating system
BigData concepts from a CS perspective that we will not talk about:
1. Frameworks for distributed computation (e.g. Apache Hadoop, Apache Spark)
2. Distributed Filesystems (e.g. HDFS, IPFS); we will talk about S3
3. Geo-distributed databases

Textbook:

Big data is a rapidly changing field, and all currently printed textbooks are both incomplete and already out of date. Therefore, we won't be using a textbook. Instead, we will be using online documentation. The main references we will use are given below, but I will provide more specific links each week.

Grades

Assignments:

Weekly labs (worth 2**1 points)
Weekly quizzes (worth 2**2 or 2**3 or 2**4 points)
Weekly homeworks (worth 2**3 or 2**4 or 2**5 points)
2 exams (worth 2**6 points each)
1. Non-graduating students will complete a final project due during finals week.

All assignments are explicitly designed to help you get a good job after graduation. They will help build your github "portfolio" and give you cool things to talk about during interviews. These assignments are all very practical, and not "leetcode" or "mathy".

You will receive extra credit for pull requests to this repo or any submodule.

Late Work Policy:

You lose 2**(i-1) points on every assignment, where i is the number of days late.

Do not expect partial credit for incomplete assignments. It is much better to submit a correct assignment late than an incorrect one on time.

Grade Schedule:

Your final grade will be computed according to the following standard table, with the caveats described below.

If your grade satisfies	then you earn
95 ≤ grade	A
90 ≤ grade < 95	A-
87 ≤ grade < 90	B+
83 ≤ grade < 87	B
80 ≤ grade < 83	B-
77 ≤ grade < 80	C+
73 ≤ grade < 77	C
70 ≤ grade < 73	C-
67 ≤ grade < 70	D+
63 ≤ grade < 67	D
60 ≤ grade < 63	D-
60 > grade	F

Caveats:

There are 2 "caveat tasks" in this course. These tasks should be easy, and everyone will get full credit on the task just for completing the task. If you don't complete one of the tasks, however, your grade (from the table above) will be docked 10%. (For example, an A- grade would become a B- grade.) You have the entire semester (until I submit grades) to complete these tasks.

You can find the details about the caveat tasks at:

Academic Integrity

Technology Policy:

You MAY use any AI tool without restriction.
You MUST complete all programming assignments on the lambda server.
You MUST use either vim or emacs for all text editing.

In particular, you MAY NOT use the GitHub text editor, VSCode, IDLE, or PyCharm for any reason.
You MAY NOT share your lambda server credentials with anyone else.

Collaboration Policy

There are no restrictions on what you can post to GitHub Issues. In particular, you are highly encouraged to post detailed questions/answers/comments with lots of code.
You are highly encouraged to collaborate with students
1. in class/lab,
2. in the QCL,
3. and in office hours.
I trust you all to be reasonable and ensure that collaboration is beneficial for your learning and not mere copying work.
You MAY NOT collaborate with students in any other context.
You MAY NOT look at another student's code on github.

All projects are developed as open source projects, and so the code is published openly online. The benefits of this model include: (1) you actually learn how to develop/contribute to open source projects; (2) future employers see you have github activity. Please do not abuse this privilege.

Accommodations

I've tried to design the course to be as accessible as possible for people with disabilities. (We'll talk a bit about how to design accessible software in class too!) If you need any further accommodations, please ask.

I want you to succeed and I'll make every effort to ensure that you can.