mlcommons / power-dev

Dev repo for power measurement for the MLPerf™ benchmarks
https://mlcommons.org/en/groups/best-practices-power
Apache License 2.0
16 stars 24 forks source link

Fix for maxAmps setting #284

Closed arjunsuresh closed 1 year ago

github-actions[bot] commented 1 year ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

dmiskovic-NV commented 1 year ago

can i suggest a change? if we pass current as argument, perhaps it would make sense to not blindly apply it, but rather run it through 3.33x (to address #280 )?

arjunsuresh commented 1 year ago

Hi @dmiskovic-NV, since this option is currently for experimental purpose and for advanced users I guess we don't need to do the scaling. Ideally at 220V, one should set 500mA as the current range for devices below 50W and 1A as the current range for devices between 50-100W.

dmiskovic-NV commented 1 year ago

That is exact problem. If you have 50W device, setting current range to 500mA will not be able to catch all the peak current values (due to crest factor) and once person runs ranging+testing, range will be different and they might end up with surprisingly higher power. So i just figured we should be mindful of that and save people from unnecessary debugging of non-existent problems :)

arjunsuresh commented 1 year ago

@dmiskovic-NV That's true. But doing a 3.3 to 500mA will make the measurements uncertain if the power is around 20W (found during my tests) . So, I think its best to leave this to the expertise of the person using it.

dmiskovic-NV commented 1 year ago

perhaps it might make sense then to add it to the warning? (to same warning that the option is for internal testing only)

arjunsuresh commented 1 year ago

Please feel free to add if you have an appropriate one :)

psyhtest commented 1 year ago

But doing a 3.3 to 500mA will make the measurements uncertain if the power is around 20W (found during my tests)

I also see a similar thing. Under the new behaviour, the range gets set to 1A for a 23-24W device (jumping around 200 mA), and I see uncertainty errors even during the testing run. Hope they are not going to invalidate runs?

arjunsuresh commented 1 year ago

@psyhtest Unfortunately the checker in current state will invalidate that result. Probably instead of 3.3X, we should do 2X or just multiply by 1/power factor.

dmiskovic-NV commented 1 year ago

i've seen systems fail with x2 and test PTD :/ have you folks noticed power regressions btw? whatever we decide here will be temporary until we get official PTD that is able to flag overloads. multiplication by 1/power factor is interesting idea, but i don't think it will work since THD (which is cause or indicator of high crest factor) and power factor are not ideally related

arjunsuresh commented 1 year ago

Then may be using crest factor 6, is a way to go for low power devices for this round?

dmiskovic-NV commented 1 year ago

let's put it this way. using 2x range (virtual crest factor of 6) will give more realistic power than using original range. however, until we get official PTD that does all the checks, we will only be sure we have power that is "measured more accurately than before", but we will not know by how much and if it is measured accurately

s-idgunji commented 1 year ago

Can we review this in our next meeting. I don't think we got enough attention to this issue in our Power WG that just wrapped up today. @araghun - FYI ... I think we need to revisit the crest factor based scaling.

arjunsuresh commented 1 year ago

@dmiskovic-NV but we need a way for submitters to submit power results for this round. Either we should change the 3.3x multiplication to 2x or we should tell submitters to use crest factor 6 for low power devices. Like @psyhtest reported, if power submissions done via ranging + testing runs have uncertain samples (due to higher current range) during the testing phase, submission checker won't accept them.

s-idgunji commented 1 year ago

Or we revert back to 2.1 like approach then figure out when PTD official release comes around. We also have to find a way to message the change.

arjunsuresh commented 1 year ago

@s-idgunji Actually code freeze is already over and from this Friday official results are going to be accepted for inference round. So any change we should do as soon as possible. The below issue is also critical as it is not allowing submissions with the latest PTD. https://github.com/mlcommons/power-dev/issues/285

s-idgunji commented 1 year ago

Understood, we can have a one off meeting this week. I would want to have a consistent approach that is ok with all submitters. This bit that is more specific to your runs is not necessarily good for all submitters. So it's better to revert to 2.1 approach if we are out of time.

Probably instead of 3.3X, we should do 2X or just multiply by 1/power factor.

arjunsuresh commented 1 year ago

I'm not sure if I'll be able to join the meeting but any solution which let the valid submissions pass the submission checker is good to us. Some of these issues we could have reported earlier but we didn't know that "--more-power-checks" is a mandatory option to the submission checker and so these issues weren't seen until we turned that option on.

s-idgunji commented 1 year ago

@arjunsuresh - If we need to reach a solution on short notice due to freeze last Fri, then we need to move quick quickly. Hence we were checking for a meeting tomorrow or so.

arjunsuresh commented 1 year ago

@s-idgunji that's fine. Just to be clear this PR is not related to the issue we are discussing and is not affecting the submissions as its affecting only the experimental feature of avoiding the manual mode. If this can be merged we can focus on the main issues for the submission.

dmiskovic-NV commented 1 year ago

@arjunsuresh Just looked at the PR code, it practically undoes #281 . If those 2 lines are changed, _desirableCurrentRange is not used, am i not right?

arjunsuresh commented 1 year ago

@dmiskovic-NV my bad. Yes, you are correct and I was thinking this code block is specific to the manual range setting scenario. Though unintentional if we want to revert #281 this PR can be merged.

arjunsuresh commented 1 year ago

@dmiskovic-NV should I fix this PR or wait for the WG decision?

s-idgunji commented 1 year ago

It may be ok to revert #281 and then consider a robust fix after along with the messaging. Let's get consensus in the PR

Sachin


From: Arjun Suresh @.> Sent: Tuesday, February 7, 2023 6:40 PM To: mlcommons/power-dev @.> Cc: Sachin Idgunji @.>; Mention @.> Subject: Re: [mlcommons/power-dev] Fix for maxAmps setting (PR #284)

@dmiskovic-NVhttps://github.com/dmiskovic-NV my bad. Yes, you are correct and I was thinking this code block is specific to the manual range setting scenario. Though unintentional if we want to revert #281https://github.com/mlcommons/power-dev/pull/281 this PR can be merged.

— Reply to this email directly, view it on GitHubhttps://github.com/mlcommons/power-dev/pull/284#issuecomment-1421897411, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANARCVIZBO7CHBXXYEP44Q3WWMBRPANCNFSM6AAAAAAUT46FZ4. You are receiving this because you were mentioned.Message ID: @.***>

arjunsuresh commented 1 year ago

Thank you @s-idgunji. #281 is not affecting our submissions and so we are fine having it, adjusting it or even reverting.

dmiskovic-NV commented 1 year ago

didn't you say you are seeing high uncertainty at 20W?

arjunsuresh commented 1 year ago

@dmiskovic-NV That's during our tests but not a submission system. The only thing blocking our power submission is PTD 1.10.0 not being supported in the checker.

dmiskovic-NV commented 1 year ago

After running some numbers from Yokogawa's manual, I have to agree with @psyhtest. When running low loads at high voltage, inaccuracy has tendency to be on a higher side when using higher range to avoid overload. I think we should merge this (so we undo 281) and we'll think of better patch for v3.1

s-idgunji commented 1 year ago

Sounds good. Thanks Dejan. Can you help do that ? When we need to get to commit to main , let us know. I can help.

Sachin

From: dmiskovic-NV @.> Sent: Tuesday, February 7, 2023 10:05 PM To: mlcommons/power-dev @.> Cc: Sachin Idgunji @.>; Mention @.> Subject: Re: [mlcommons/power-dev] Fix for maxAmps setting (PR #284)

After running some numbers from Yokogawa's manual, I have to agree with @psyhtesthttps://github.com/psyhtest. When running low loads at high voltage, inaccuracy has tendency to be on a higher side when using higher range to avoid overload. I think we should merge this (so we undo 281) and we'll think of better patch for v3.1

— Reply to this email directly, view it on GitHubhttps://github.com/mlcommons/power-dev/pull/284#issuecomment-1422061527, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANARCVMKBEL6SEFGB5SUODDWWMZR5ANCNFSM6AAAAAAUT46FZ4. You are receiving this because you were mentioned.Message ID: @.**@.>>

s-idgunji commented 1 year ago

@dmiskovic-NV , @arjunsuresh , @psyhtest - Can I review and merge this so we go back to the approach from #281 and then consider that fix in v3.1 with sufficient time ?

dmiskovic-NV commented 1 year ago

I think that is the best course of action. once v3.0 settles, let's plan for big work on v3.1 with sufficient time for testing all edge cases

arjunsuresh commented 1 year ago

@dmiskovic-NV Actually this PR also removes the effect of 10% scaling which I believe is the old behaviour. We need to add this too right?

s-idgunji commented 1 year ago

Is it possible to roll in a consistent change to avoid thrashing? Do we want to have a quick meeting given the urgency?

From: Arjun Suresh @.> Sent: Wednesday, February 8, 2023 10:20 AM To: mlcommons/power-dev @.> Cc: Sachin Idgunji @.>; Mention @.> Subject: Re: [mlcommons/power-dev] Fix for maxAmps setting (PR #284)

@dmiskovic-NVhttps://github.com/dmiskovic-NV Actually this PR also removes the effect of 10% scalinghttps://github.com/octoml/power-dev/blob/master/ptd_client_server/lib/server.py#L973 which I believe is the old behaviour. We need to add this too right?

— Reply to this email directly, view it on GitHubhttps://github.com/mlcommons/power-dev/pull/284#issuecomment-1423052311, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANARCVP7IFDDGEY7FFH44ALWWPPVNANCNFSM6AAAAAAUT46FZ4. You are receiving this because you were mentioned.Message ID: @.**@.>>

dmiskovic-NV commented 1 year ago

It goes exactly back to old behavior. 10% was implied in meter

Thanks, Dejan


From: s-idgunji @.> Sent: Wednesday, February 8, 2023 10:21:40 AM To: mlcommons/power-dev @.> Cc: Dejan Miskovic @.>; Mention @.> Subject: Re: [mlcommons/power-dev] Fix for maxAmps setting (PR #284)

Is it possible to roll in a consistent change to avoid thrashing? Do we want to have a quick meeting given the urgency?

From: Arjun Suresh @.> Sent: Wednesday, February 8, 2023 10:20 AM To: mlcommons/power-dev @.> Cc: Sachin Idgunji @.>; Mention @.> Subject: Re: [mlcommons/power-dev] Fix for maxAmps setting (PR #284)

@dmiskovic-NVhttps://github.com/dmiskovic-NV Actually this PR also removes the effect of 10% scalinghttps://github.com/octoml/power-dev/blob/master/ptd_client_server/lib/server.py#L973 which I believe is the old behaviour. We need to add this too right?

— Reply to this email directly, view it on GitHubhttps://github.com/mlcommons/power-dev/pull/284#issuecomment-1423052311, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANARCVP7IFDDGEY7FFH44ALWWPPVNANCNFSM6AAAAAAUT46FZ4. You are receiving this because you were mentioned.Message ID: @.**@.>>

— Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmlcommons%2Fpower-dev%2Fpull%2F284%23issuecomment-1423054072&data=05%7C01%7Cdmiskovic%40nvidia.com%7Cfa028c3225944f7df77408db0a015318%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638114773024188044%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=SoYW01HtKcV95VAM88KO1%2BBKWU156jiUViPPyzJF%2FSY%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FASV77IF4SUKNA2H6YUDJA5TWWPP3JANCNFSM6AAAAAAUT46FZ4&data=05%7C01%7Cdmiskovic%40nvidia.com%7Cfa028c3225944f7df77408db0a015318%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638114773024188044%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6kKp%2FLbF3Ls38w5KXOz6me8ISbzcTPf0VlPU6Xxiqxo%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

arjunsuresh commented 1 year ago

Thanks @dmiskovic-NV. I see that the 1.1 scaling was also new. I have now reworked my PR to be consistent with your changes. But it is still using 1.1 scaling for > 75W and 1.5 scaling for < 75W which I believe is safe for all current ranges (probably not accurate enough but still better than scaling of 1). If everyone wants I can change this to 1 (restore the old behaviour) but since the issue is now public submitters might be tempted to abuse it.

s-idgunji commented 1 year ago

I do not understand this point about this “ If everyone wants I can change this to 1 (restore the old behaviour) but since the issue is now public submitters might be tempted to abuse it.” We want to make sure this is consistent across users. Are you saying anyone can just change this? Will we not have it validated through logs and submission checkers?

From: Arjun Suresh @.> Sent: Wednesday, February 8, 2023 10:48 AM To: mlcommons/power-dev @.> Cc: Sachin Idgunji @.>; Mention @.> Subject: Re: [mlcommons/power-dev] Fix for maxAmps setting (PR #284)

Thanks @dmiskovic-NVhttps://github.com/dmiskovic-NV. I see that the 1.1 scaling was also new. I have now reworked my PR to be consistent with your changes. But it is still using 1.1 scaling for > 75W and 1.5 scaling for < 75W which I believe is safe for all current ranges (probably not accurate enough but still better than scaling of 1). If everyone wants I can change this to 1 (restore the old behaviour) but since the issue is now public submitters might be tempted to abuse it.

— Reply to this email directly, view it on GitHubhttps://github.com/mlcommons/power-dev/pull/284#issuecomment-1423083370, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANARCVOLGP4E52MSURUWDKLWWPS4ZANCNFSM6AAAAAAUT46FZ4. You are receiving this because you were mentioned.Message ID: @.**@.>>

arjunsuresh commented 1 year ago

No Sachin. What I meant is if we restore the old behaviour exactly, submitters can then use the lowest crest factor on the device and this can lead to lower power being captured (for devices around 20-30W power range) which will pass through all our checks. So, a small scaling of the maxAmps is good to have.

dmiskovic-NV commented 1 year ago

by deafult, Yokogawa uses crest factor of 3 (which is lowest).

arjunsuresh commented 1 year ago

yes, the problem can happen at crest factor 3 but less likely at crest factor 6.

s-idgunji commented 1 year ago

Hi Arjun- Can you clarify

submitters can then use the lowest crest factor on the device and this can lead to lower power being captured (for devices around 20-30W power range)

My understanding is that the factor is 3.0 and consistent across all submitters. What is the issue? How can one set even lower ?

From: Arjun Suresh @.> Sent: Wednesday, February 8, 2023 11:15 AM To: mlcommons/power-dev @.> Cc: Sachin Idgunji @.>; Mention @.> Subject: Re: [mlcommons/power-dev] Fix for maxAmps setting (PR #284)

No Sachin. What I meant is if we restore the old behaviour exactly, submitters can then use the lowest crest factor on the device and this can lead to lower power being captured (for devices around 20-30W power range) which will pass through all our checks. So, a small scaling of the maxAmps is good to have.

— Reply to this email directly, view it on GitHubhttps://github.com/mlcommons/power-dev/pull/284#issuecomment-1423114530, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANARCVJJJ6ITKLSVSQ6APH3WWPWEZANCNFSM6AAAAAAUT46FZ4. You are receiving this because you were mentioned.Message ID: @.**@.>>

dmiskovic-NV commented 1 year ago

@arjunsuresh agreed, however, we don't have programatical/systemic way to set crest factor to 6 (it is beyond the scope of current PTD). And all submitters on v2.1 were using out-of-the-box settings on Yokogawas (which is 3)

s-idgunji commented 1 year ago

So , we agree that all submitters will use CF of 3.

From: dmiskovic-NV @.> Sent: Wednesday, February 8, 2023 11:26 AM To: mlcommons/power-dev @.> Cc: Sachin Idgunji @.>; Mention @.> Subject: Re: [mlcommons/power-dev] Fix for maxAmps setting (PR #284)

@arjunsureshhttps://github.com/arjunsuresh agreed, however, we don't have programatical/systemic way to set crest factor to 6 (it is beyond the scope of current PTD)

— Reply to this email directly, view it on GitHubhttps://github.com/mlcommons/power-dev/pull/284#issuecomment-1423128157, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANARCVL67VAW4QXTN6CSGSDWWPXODANCNFSM6AAAAAAUT46FZ4. You are receiving this because you were mentioned.Message ID: @.**@.>>

arjunsuresh commented 1 year ago

@s-idgunji I don't think we can assume that - for example our power meter is currently at crest factor 6. But my point is not just about the crest factor but when we know the power range where the problem is happening submitters might try to optimize accordingly.

I'm assuming a power factor of 0.5 which is common around 20-30W power usage.

At 220V, and 20W power usage, maxAmps will be lower than 100mA and this can set Yokogawa 310E to 100mA limit. At 220V, and 23W power usage, maxAmps will be above 100mA and this can set Yokogawa 310E to 200mA limit.

As seen from our tests the first case can cause lower power being reported whereas the second case is likely to report accurate power. Also, people using Yokogawa 310E will be having an advantage as it has smaller current ranges and so this problem (reported power being lower) will be more visible on them as compared to say Yokogawa 310.

dmiskovic-NV commented 1 year ago

I'd rather keep 1x, to make us consistent with v2.1

once we get new PTD (based on Greg's unofficial build), let's move on to adjusting scaling Also, unofficial build is still not perfect, i had some issues with multichannel devices with it :/

arjunsuresh commented 1 year ago

Thank you @dmiskovic-NV . I'll make that change. It should be to change both 1.1 and 1.5 in the current PR to 1 right?

dmiskovic-NV commented 1 year ago

yes. resetting it to v2.1 (baseline)

s-idgunji commented 1 year ago

I'm assuming a power factor of 0.5 which is common around 20-30W power usage. We should not assume since this is reported in the logs. However, if you want to discuss this , we can do that next week. I note that the Inference WG co-chair has got meetings for emergency PRs. The Amp limit should come from ranging that is automated and we should not change that.

From: Arjun Suresh @.> Sent: Wednesday, February 8, 2023 11:38 AM To: mlcommons/power-dev @.> Cc: Sachin Idgunji @.>; Mention @.> Subject: Re: [mlcommons/power-dev] Fix for maxAmps setting (PR #284)

@s-idgunjihttps://github.com/s-idgunji I don't think we can assume that - for example our power meter is currently at crest factor 6. But my point is not just about the crest factor but when we know the power range where the problem is happening like below.

I'm assuming a power factor of 0.5 which is common around 20-30W power usage.

At 220V, and 20W power usage, maxAmps will be lower than 100mA and this can set Yokogawa 310E to 100mA limit. At 220V, and 23W power usage, maxAmps will be above 100mA and this can set Yokogawa 310E to 200mA limit.

As seen from our tests the first case can cause lower power being reported whereas the second case is likely to report accurate power. Also, people using Yokogawa 310E will be having an advantage as it has smaller current ranges and so this problem (reported power being lower) will be more visible on them as compared to say Yokogawa 310.

— Reply to this email directly, view it on GitHubhttps://github.com/mlcommons/power-dev/pull/284#issuecomment-1423141096, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANARCVPRJGBSVT2U3JCPZSTWWPY2NANCNFSM6AAAAAAUT46FZ4. You are receiving this because you were mentioned.Message ID: @.**@.>>

arjunsuresh commented 1 year ago

@s-idgunji I'm just telling that some submitters might take advantage of #280 - as this issue is not uniform for all the power ranges. For example, submitter 1 will have power reported as 21W and submitter 2 will have power reported as 25W - almost a 16% difference when the actual difference might be <5%.

For our submissions we are only benchmarking systems and so unless mandated we'll be submitting with crest factor 6 for low power devices.

Since closing the PRs is important I have now made the change as @dmiskovic-NV suggested. If the PR from Pablo is merged first then I can also update the checksums in this PR.

arjunsuresh commented 1 year ago

No longer needed. So closing