What do you want to learn about?

jpivarski commented 1 year ago

I've been asked to present a tutorial to TAC-HEP and WATCHEP trainees, but I'd like to get a sense of your level and topics of interest, first. Let me know in this issue thread what you want to learn about and how much you know already.

To get a sense of what my tutorials are usually like, see my previous ones.

The date is still to be determined, the duration is anywhere between 2 and 6 hours, and it is likely to be over Zoom (if not somehow connected with CoDaS-HEP, but we're all already busy then).

Python? What level?
Array/columnar techniques? Again, what level?
Making Python fast?
Numba? (Related to the above.)
HEP-oriented project?
Python/C interfaces?
Dask? (see also)
Scikit-HEP?
Language development? (Probably not!)
Others?

If you want to try something outside my normal set of topics, I'm definitely open to it, but we'll iterate here, in this issue thread, to converge on something realistic.

JasonNielsen commented 1 year ago

Hi, Jim,

I just finished touching base with all of the WATCHEP trainees to get a list of possible topics. I'll dump them here and let everyone comment, upvote, etc.

Everyone has at least a basic Python level. Advanced Python or your loopy tutorial would probably be useful.
Python configuration: virtual environments
Most people have ntuples, so the efficient columnar ntuple tutorial would be helpful.
Working with different file formats: ROOT, HDF5, FITS. (could be related to awkward). Not everyone is in collider physics.

Other topics that may or may not be in your wheelhouse:

containerization
writing unit tests
Git CI/CD

skkwan commented 1 year ago

Hello Jim, all,

I'm in a grad student in CMS, in TAC-HEP, and would be interested in Dask.

I have not used awkward arrays, focusing on using RDataFrame and C++ for a physics analysis, but I'm very curious about best practices for generating many (~hundreds of) histograms that are slightly different, like for systematics in an analysis. But since not everyone is in collider physics maybe this is too niche.

mirarenee commented 1 year ago

Hi Jim,

I'm a grad student in cosmology in the TAC-HEP program. I'd be interested in accelerating Python through any of the methods above, as well as array-oriented programming.

If this is in your area, I'd also be interested in learning about Git more thoroughly.

jpivarski commented 1 year ago

Thanks for the suggestions so far!

For virtual environments, I don't know how much I would say. Personally, I use conda/mamba (instead of venv), which just works: I have a base environment that I usually work in and make a specialized environment whenever I want to do a clean test. Some people will say that you should always be in specialized, clean environments, but I just keep in mind that the base environment might be misleading and do a specialized environment when I want to go from being 95% certain to being 100% certain. I don't know if that would be an interesting mini-lecture: it might be about 10 minutes before I run out of material.
There have already been enough responses in favor of columnar/array-oriented techniques that I should do this. It's what my usual tutorial focus on, so there's plenty of material.
File formats is an interesting idea. I've presented on that before, but not as a tutorial. Given that you're not all from collider physics, this could introduce me to more formats (and I think that would be fun). So let me know if there's anything I should add to this list: ROOT, HDF5, NetCDF, Parquet, Arrow, FITS, NPY, JSON schemas, Avro. There are other formats, but these have enough differences among them that a story can be woven between them. Are there other formats that you use in your work that I should add to the list so that it's more beneficial to you?
On containerization, I'm just not familiar enough. (I don't even have Docker installed.)
Writing automated tests is an interesting idea, since there's an art to it. You want to constrain things that ought to stay the same and keep unconstrained (i.e. don't test!) things that you expect to change. There's also a matter of testing fine-grained units of functionality versus whole systems. Some people get deep into "mocking" parts of their system so that they are able to test one fine-grained part without testing the rest (and some mocking infrastructure is needed because that small part expects to talk to a larger system). However, that usually seems like overkill to me. There are also matters of coverage (100% coverage is not always necessary, though more coverage is always better than less), and test-generating systems like hypothesis, which I only have a little familiarity with. I have, however, been thinking about what a tutorial on the art of testing would look like: maybe a sample project with ever-changing goals, and you'd have to write tests that fit the current goals and aren't too constraining as the goals change?
I know a minimum of git commands[^1], less than the HSF/Software Carpentry tutorial on git, but it has been enough to be productive. I would have more to say on issue/PR/release workflows—the GitHub or GitLab features—and maybe this could be combined with a sample project that also demonstrates automated tests? I'd have to think of what that sample project could look like, and make sure that it doesn't take too much time.
Generating hundreds of histograms: I feel your pain! Libraries like boost-histogram or hist partially solve this problem by scaling to higher-than-3 dimensions, letting you partition your problem along more axes and do an analysis in the already-aggregated data. But you probably mean histograms for systematics variations, which is not a dimension—one entry would go into every systematic variation (multiple bins/boxes), whereas an axis would have one entry go into a single voxel bin. The status quo in Python is still a big collection of histograms, filling each in a for loop (over histograms: the histogram-filling itself is vectorized). RDataFrame's Vary is a nice solution to this problem. Maybe we should have some content on the Awkward Array ←→ RDataFrame interface?
Dask would be good: there would be a lot to talk about, though I haven't prepared the material for that yet. It can include dask-awkward and dask-histogram.
For accelerating Python, there are many methods, but my favorite is Numba (vertical scaling) + Dask (horizontal scaling), which work well together. I could mix this in to material about Dask.

All in all, the above is more than I could cover in 6 hours, but we can keep brainstorming before we have to set priorities.

[^1]: Just enough to git around.

jmw464 commented 1 year ago

Hi Jim,

I'm a grad student in ATLAS (in WATCHEP) and I think the main thing I'd be interested in is some more advanced ways to speed up python (Dask and Numba tutorials sound great to me). I've worked with awkward before but I would also support doing some more things related to array programming. Ideally I'd want to take away some things to speed up my data processing pipelines. Thanks for putting this together!

Nanoemc commented 1 year ago

Hi Jim!

I am a graduate student working on CMS. I have an analysis I am working on, and I work on the Elastic Analysis Facility (EAF) at Fermilab. I have moderate python experience, I usually just look stuff up whenever I need it. I have experience with C++, Coffea, git, and ROOT (although it's quite rusty). I have a little experience with docker, helm, and kubernetes (very little).

I am interested in learning more about the following:

How Dask works and how it can be used
More about array/columnar techniques
Any advanced python

I'd also be interested in learning about ADLs or machine learning. I have also taken a software course that went over object oriented python, CI/CD, and tests but I wouldn't mind going over some CI/CD or tests topics again.

jpivarski commented 1 year ago

Okay, another +1 for Dask; I should definitely involve Dask and dask-awkward in a presentation of columnar analysis.

I felt confident to talk about machine learning in 2015, but much less so today.

The trouble with a topic like "advanced Python" is that I don't know how to tie it together into a coherent story. Henry does a good job of it with Level Up Your Python, which I highly recommend. I'll keep thinking about it, though.

Thanks for the suggestions!

twnelson0 commented 1 year ago

Hi Jim, I am a CMS graduate student in the TAC-HEP program. I have a fairly reasonable amount of experience with C++ and python. Of the topics you listed I would be most interested in learning about array/columnar techniques, dask and numba.

rpsimeon34 commented 1 year ago

Hi Jim, Continuing a theme; I'm also a CMS grad student in TAC-HEP, and would be very interested in Dask/numba and array/columnar techniques. I'll also throw Kubernetes and Condor out there, but we already have a pretty sizable list of interesting sounding topics going. Thank you!

jpivarski commented 1 year ago

Thanks for the suggestion! I'll add a +1 to Dask/Numba (it's looking like that's going to be the core of what I'll talk about).

I'm not really qualified to teach Kubernetes and Condor, though. (Sorry! But it was worth suggesting.)

mattkwiecien commented 1 year ago

Howdy Jim! I'm one of the Matts participating in WATCHEP, but not in collider physics! I'm working on some of the software infrastructure and analysis pipelines within LSST DESC.

Out of the topics listed above, discussion of how to include specifically GPU parallelization into our development would be helpful. Maybe best practices or design patterns within python which best allow for GPU parallelized code (if any exist!).

I think this would go into the category of interfacing python with C++ (i.e. have my python call some lower level GPU code to perform some numerically intense integrations). Or maybe this falls into the dask category discussed above?

Thanks and looking forward to the school!

jpivarski commented 1 year ago

Thanks!

I now know that this traineeship summer school will be July 24‒28. I'll be teaching on July 24 (Monday) and will be helping out with a coding jam that extends over the whole week. (I'll have limited availability after Monday, since I'll be convening another workshop at the same time.)

From the above, a central theme on columnar analysis, vertical scale-out with Numba and horizontal scale-out with Dask kept coming up, so I'll focus on that. I'll be presenting material on columnar analysis in general at CoDaS-HEP, which many of you will be at, so I can use the two 1.5 hour blocks I'll have on July 24 for vertical and horizontal scale-out.

Second to the above, unit tests/CI and git practices also came up fairly often. I think we should integrate those into the coding jam and I'll be talking with the main developer of those exercises tomorrow.

@mattkwiecien, GPU parallelization would be an interesting topic, but it would involve more set-up (getting shared resources with GPUs on the day of the training—which is not insurmountable) and hasn't been a central theme of the requests. Maybe I can do some of the Numba examples with CUDA (Numba has a CUDA backend), but the main thing I should point you toward is CuPy, if you haven't already heard of it, and its ability to JIT-compile kernels in particular. That gives you a nice Python-C++ interface that's directly focused on GPUs.

jpivarski-talks / 2023-07-24-tac-hep-tutorial

What do you want to learn about? #1