Closed georgelyuan closed 3 years ago
+1. An alternative would be to upload a tgz to a write-only blob storage that mlperf folk would merge into the repo and give access to the repo only once the tgz is merged.
Generally, we should watch ROI when increasing burden. Though Guenther's plan sounds like it might actually be easier than the current approach, but would be best to do it under MLC CLA -- and make sure there is acknowledgement of submission as being a contribution under CLA.
On Tue, Sep 22, 2020 at 8:51 AM Guenther Schmuelling < notifications@github.com> wrote:
+1. An alternative would be to upload a tgz to a write-only blob storage that mlperf folk would merge into the repo and give access to the repo only once the tgz is merged.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlperf/policies/issues/61#issuecomment-696809799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIIVUHKSES56UQBHS6F4HO3SHDBZFANCNFSM4RU7NSFQ .
Sure, personally I'm not hung up on how we do it, as long as we address the one-sided visibility issue. I'm sure we can come up with something reasonable.
AIs going forward, discuss in training, amend rules, change infrastructure.
Ashwin suggests the following:
Ashwin will investigate whether PRs from a private submitters repo can be sent to the private MLC submitters repo and report back.
We would like feedback from training, HPC, mobile on this.
To clarify step 4, if the PR is being made from a private fork to another private fork, I don't think the forks have to be made public, as long as the chair has access to both forks, but I guess that needs to be verified. Perhaps step 4 should be reworded as:
I kinda like the old approach, when you can submit your results as you generate them in the window between the random seeds being distributed and the submission deadline. It gives me a warm and fuzzy feeling having PRs piling up in the main submission repo. I don't mind in the slightest that other submitters can see them early.
Whatever else we decide to allow, can we keep the old approach please?
It allows companies who have no interest in submitting early access to results.
We already guard against that somewhat, by asking non-submitters to declare themselves before the repo gets properly populated (e.g. 1 week). Of course, a submitter can still declare they intend to submit until the last minute and then not to, or submit and then withdraw. Were the review committee to suspect dishonest behaviour, they could take a vote to disqualify them from submitting to the next submission round.
The proposal is just to allow delaying the real PR by providing a hash in the meantime. If you send the PR as you would have in the past it should be the same imo.
I think it's fine if one wants to directly submit their PR to the submission repo. Note however that this submission repo is accessible by all potential submitters, including those who in the end may not actually submit. One of the goals of this proposal is to prevent such 'no-show' submitters from gaining access to the results before the general public. In order to do that, I think the results chairs would have to merge submissions from the original submission repo and any submitter forks to another fork that is access-restricted to real submitters.
So to clarify what I have in mind:
draft_submission_repo - public repo that submitters can fork to push their results, or directly submit to if they so choose. used for CLA checks, uploading calibration docs. accessible by all potential submitters.
[company]_submission_repo - private forks of the draft submission repo that various submitters will use to make their submission. owned by the submitter, who will give access to the results chairs by submission deadline.
final_submission_repo - private fork of the draft submission repo that will contain merges of all submissions in the draft_submission_repo and
After the final_submission_repo has been populated, the draft_submission_repo can probably be deleted.
I found that Github does NOT allow these operations:
@georgelyuan Looks like the steps need to be:
For the submission repo we never have a public repo. What gets published at the end is a copy into a new repo. This is because we don't want to have the issues filed during review go public and we need to avoid that revoked submissions (that are still in that repo) become public.
Not totally sure what happens between step 1 and step 2 above - how does the submitters repo gets PRed back into our private repo ? I think as long submitters repo is private the PR will not work cross org.
One could make a git patch and upload that to some write only blob storage at the deadline and we'd just apply that patch to the repo in mlcommons.
@guschmue Aha, I did not realize PRs do not work between private forks. Your git patch seems like a feasible solution. What is an example of a write only blob storage space?
aws, azure, gcp have blob storage and you can create write only access (I know for sure on azure but aws and gcp will do the same)
@georgelyuan please work with Guenther to have a workable solution. Please note it is optional for the submitters. The old way of submitting is fine for v1.0
Revised proposal:
By the submission deadline, submitter will submit 4 things to the results chair to show Proof of Work:
As soon as the deadline passes, the submission repository goes private. Access will be removed for everyone EXCEPT the following parties:
For #3 above, verification will be performed by the results chair using a script (NVIDIA can contribute to this):
add_submission.py -u <submission URL> -p <password> -c <submitter company> -t <submission repo directory> -g <submitter github handles> -h <sha1 hash>
With the process outlined, the results chair will push submissions made using the new secure submission process to minimize delays. The competitive intelligence gap raised by the issue now no longer exists since no participant can now view secure submissions before the deadline (submission tarballs are encrypted). Once the deadline has passed, only verified submitters will be able to view each other’s submissions.
In an alternative post-deadline process, the chair can give access to the submitter and have the submitter push the results in a timely manner. At this point, the submitter is committed to submitting results. If they do not for whatever reason, then the results chair will do so in their stead, since by that point the submitter will have access to other submitted submissions. Once the submitter pushes the results, the results chair will still need to verify that the submission matches the decrypted extracted tarball.
WG: submission rules will be updated with this optional process. PR review next meeting.
@guschmue @georgelyuan are working on the script for the submission encryption. Pending script check-in.
submitter can pick any storage but the access has to be public.
@georgelyuan PR reminder :)
Should I just put it in the MLCommons shared drive? Not sure where to put the script
I have a bunch of submission related scripts here: https://github.com/mlcommons/inference/tree/master/tools/submission you could add it there.
Right my only objection is that the script is not inference-specific.. Do you think that's okay?
yeah, no worry - we can come up with something better once training folks ask for it.
@guschmue do you have a mlcommons email? Or should I instruct folks to use your microsoft email?
good question, maybe we should create a new address for this. Let me find somebody to create one.
any linked PR for this? @georgelyuan
pending on me - I'm waiting for Peter to create mail alias for it
@petermattson
@georgelyuan ... you can use submissions@mlcommons.org
D'oh looks like I'm missing the mlcommons CLA or something? I just submitted my form.
Ah I see Guenther had sent me an invitation back in January but it expired :( Can you send it again? Thanks!
https://github.com/mlcommons/inference/pull/846 Hopefully I did this right. It's been a while..
@georgelyuan is there any rule text update related to this? Thanks for the script.
The current submission process is inadequate in two ways: 1) It allows submitters who have not yet submitted to access the submission repo and preview other submitters' results. This gives potential submitters an unfair competitive advantage if they wait until the last minute before deciding to submit. Unfairness is further compounded by the submission deadline not being friendly for all time zones. Companies in Asia typically submit the night before. Also, it is difficult for companies that coordinate with worldwide partners to synchronize submission time, putting them at a further disadvantage . 2) It allows companies who have no interest in submitting early access to results.
Proposal for next round:
@DilipSequeira @petermattson @TheKanter