Address 1.1 review period issues

emizan76 commented 2 years ago

The 1.1 review period was problematic. From the perspective of someone who has worked on the logger, we were too lenient and a lot of submissions that failed compliance (single-file, number of files, RCP) were eventually accepted. Here I am proposing two changes so that such problems are minimized in the future. These changes are orthogonal, so we can have either or both.

(1) Establish an automatic way of submission. This will likely have minimal to impact to our infra. Submitters will be submitting an encrypted tarball, probably in a GCS bucket plus the decryption key. They will be able to do that way ahead of the submission deadline. Submission may be made through a web-page. When the submitter clicks a "submit" button, a backend script will decrypt, untar and run the checkers on the submission package. If there is an error the whole package will be rejected. This mechanism will likely be implemented by a contractor, so we will not need to be actively involved.

(2) Instead of addressing submission failures post-submission, we will be doing this pre-submission. Ideally the logger should be frozen 1 month prior to submission and the logger tutorials will happen 3-4 week prior to submission. The last 2 times they happened 2 weeks before submission, and there were changes added to the logger very close to the deadline. After the tutorials, the submitters will be allowed to make questions on log files / dirs that fail the checkers. To hide the actual submission, we may need to provide them a script that scrambles the logger, e.g. it changes the timestamps or hides info such as number of accelerators. After the deadline we will not provide any help unless forced by the committee. The burden on us will be to have even more work prior to submission and be able to see other submitters submissions well before the deadline.

These proposals need to pass from the training committee, and I have my doubts that submitters will accept revealing their logs pre-submission. But I would like to know what the infra group thinks.

shangw-nvidia commented 2 years ago

I like (1). We can stitch the encryption/decryption engine, package_checker, rcp_checker and result_summarizer in a single client. When the user click "submit", the client will run the checkers and result_summarizer locally; if all checks pass, the client will ask user whether he/she wants to submit the result; if the user click on Yes, the client will send the result to our server (we will have to find a place to host this server) which then validate all checks remotely again; if all checks pass, we can mark this submitter "submitted" and store this submission. The submissions will be stored on the server and will not become public until the deadline is passed.

For (2), I would say many last minute changes are related to RCPs. People like to generate/release RCPs at last second. I would propose to set a separate deadline for RCPs (which can be one month before the submission deadline), and the hyper-parameter borrowing happens pre-submission instead of post-submission. After the RCP deadline but before the submission deadline, the RCPs will be fixed and the submissions will be checked against the existing RCP.

I have an alternative proposal which is to "abolish" the submission deadlines completely. Each submitters will submit results at any time they want (and coordinate with the training and logging WG to address issues along the way). Once a submission is made and accepted, we can publish news/blogs about it (e.g., Google pushed resnet training perf from A to B using methods/techniques W and X, or NVIDIA pushed bert training perf from X to Y using methods/techniques Y and Z, etc.). Most likely, not many submitters will submit at the same time, so this would make the training benchmarks less of a "competition" but more of a "regression" that gradually push the boundary of large-scale training forward.

xyhuang commented 2 years ago

having an automated compliance checking infra would be helpful to the submission process and good for enforcing/promoting good practices. details might need to be discussed as of how this could be implemented. potentially learn from inference group.
freezing log early is doable (with the exception of RCP, maybe). might be up to the training committee how this is enforced.

mlcommons / logging

Address 1.1 review period issues #203