minnervva / torchdetscan

This is a tool for finding non-deterministic functions in your pytorch code.
https://github.com/minnervva/torchdetscan
MIT License
1 stars 0 forks source link

SummitPLUS allocation #14

Open markcoletti opened 10 months ago

markcoletti commented 10 months ago

We may need a SummitPLUS allocation to run tests for determinism on a different h/w platform.

markcoletti commented 10 months ago

@gitrepoidoscar , @elwasif , and @asedova , the deadline for asking for a SummitPLUS allocation is the 30th, presuming we want one.

asedova commented 10 months ago

Do we want one? I don't anticipate us needing 100K node hours for this project, but you never know. I just wrote one for Slava and am about to write another one and they are not that easy, there are quite a few sections.

If we were to request ~80K node hours, what would we do with them? Only the data generation for DeePMD really uses a lot of compute (and any iterative active learning).

But maybe better safe than sorry.

markcoletti commented 10 months ago

But maybe better safe than sorry.

That's what I was thinking. Worse case, if we decide not to do this, I'm probably going to get a SummitPLUS allocation for a different project, and we can use some of the time from that. Regardless, as you stated, it can't hurt to have our own!

asedova commented 10 months ago

Problem is, we need that extensive justification for leadership time, which I don't know if we have.

markcoletti commented 10 months ago

I sat next to Ashley Barker last week at the OLCF Users Meeting, and she encouraged me to pass on to others to put in requests. I got the impression they were going to green light what they get and let it be a battle royal, or something. In any case, can't hurt to try. Worse they can do is bounce it, in which case we have alternatives.

asedova commented 7 months ago

We had decided to use the one that related to Oscar's EXPRESS project and allow some MINNERVVA work on there. Oscar's EXPRESS application was rejected. So I guess we need to work on getting a DD on Frontier instead.