sterrettJD / gpLM-reading-group

2 stars 0 forks source link

some curriculum suggestions #3

Open zmaas opened 2 weeks ago

zmaas commented 2 weeks ago

Hey John! Here's the curriculum that I've worked on in the past. It's a bit less focused on language models as a sole topic, and more on modern ML from a broad perspective.

zmaas commented 2 weeks ago

Also likely useful for more advanced topics is this curriculum from Waterloo (grad level CS seminar) on recent language model advances: https://cs.uwaterloo.ca/~wenhuche/teaching/cs886/

sterrettJD commented 2 weeks ago

Thanks Zach! I'm looking through these resources and trying to determine what's necessary and what would be overkill, given the audience.

I'm also looking through some of the resources this course has been assembling: https://github.com/Multiomics-Analytics-Group/course_protein_language_modeling

sterrettJD commented 2 weeks ago

In a similar vein, @casey-martin says:

Relevant topics:

  • context length (Mamba, Hyena, State Space Models)
  • multimodality (combined DNA/prot/RNA, seq-to-struct)
  • reinforcement learning (experimental and computational feedback)

How much background knowledge will the group have? Do they know what attention is? MLM vs autoregressive vs diffusion? Tokenizers?

I think we should probably assume that the group does not know what attention is, nor do they know about the different kinds of models or tokenizers. The primary audience is computational biologists - we'll have a good number of people from EBIO and MCDB at CU, but they aren't going go to be people who have already used language models. Once we get into the applications, I imagine we could have more CS-type folks joining.

I think we should probably have 2 seminars devoted to the basics (would be best to have someone from an NLP research group at CU come talk), then we jump into some genomic models and start to discuss drawbacks (e.g., why does context length matter) by layering in some of these concepts.

What do you two think?