Open taran2210 opened 1 month ago
As discussed in the WG today, it will be great if MLCommons can find a way to share the preprocessed dataset with submitters. @arjunsuresh @pgmpablo157321
@taran2210 Can you please share the md5sum of the obtained day_23_sparse.npy
?
@nathanw-mlc I do have the data in google drive. But do we need any extra permission to share the criteo preprocessed dataset with MLCommons members? This is their licensing.
@arjunsuresh md5sum for day_23_sparse.npy: d6c11a8cebb5cabcea6ff67646641f3f
But do we need any extra permission to share the criteo preprocessed dataset with MLCommons members? This is their licensing.
This is a question for @swasson488
From what I can tell, we can provide our own public download, so long as we give appropriate credit to the original creator of the dataset, provide a link to the license, and indicate if we made any changes to the dataset. However, I don't know the legal details regarding the Non-Commercial requirement.
@nathanw-mlc Based on the license language, I'm comfortable with MLCommons sharing the pre-processed version of the data set with our members. We need to display the following text or include it with the files for download:
-
This is a preprocessed version of the Criteo 1TB Click Logs data set for MLPerf.
The data set is copyrighted by Criteo AI Lab. It was created by Criteo AI Lab and is shared with MLCommons members via the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License. A full copy of the license is available here along with the full original data set:
https://ailab.criteo.com/ressources/criteo-1tb-click-logs-dataset-for-mlperf/
Criteo AI Lab makes no warranties, express or implied, about this data set.
-
Tagging @TheKanter for his awareness.
Thank you @swasson488
@nathanw-mlc I have shared the preprocessed dataset with you. Once downloaded please do confirm the md5sum
c46b7e31ec6f2f8768fa60bdfc0f6e40 day_23_sparse_multi_hot.npz
I have shared the preprocessed dataset with you. Once downloaded please do confirm the md5sum
Uploading the data direct to the Inference Cloudflare R2 bucket now. Will run md5sum checks when complete.
We need to display the following text or include it with the files for download
I've included this file in the directory: https://inference.mlcommons-storage.org/dlrm_preprocessed/README.txt
Thank you, Nathan. For some reason, the link is getting flagged by our internal security (requested a white-list exception). But if the file is not too large, can you possibly paste the text here in a comment?
Hey @keithachorn-intel
The text is just the text provided by @swasson488:
This is a preprocessed version of the Criteo 1TB Click Logs data set for MLPerf.
The data set is copyrighted by Criteo AI Lab. It was created by Criteo AI Lab and is shared with MLCommons members via the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License. A full copy of the license is available here along with the full original data set:
https://ailab.criteo.com/ressources/criteo-1tb-click-logs-dataset-for-mlperf/
Criteo AI Lab makes no warranties, express or implied, about this data set.
The text will be automatically downloaded along with the data, so users won't have to view it at that link.
Gotcha. I thought the link included the instructions for accessing the processed dataset. We'll await your upload/checksum updates. Thanks!
Hi @nathanw-mlc . I know you're working on this and the size of the dataset will require time. But for our internal planning purposes, do you have an ETA for when the processed dataset will be accessible? Thanks!
To run Rclone on Windows, you can download the executable here. To install Rclone on Linux/macOS/BSD systems, run:
sudo -v ; curl https://rclone.org/install.sh | sudo bash
Once Rclone is installed, run the following command to authenticate with the bucket:
rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
You can then navigate in the terminal to your desired download directory and run the following command to download the preprocessed dataset:
rclone copy mlc-inference:mlcommons-inference-wg-public/dlrm_preprocessed ./ -P
Checksum matches; all good to go
Thanks a lot @nathanw-mlc . The total download is ~150GB. The preprocessing required ~6.4TB disk space :)
We have added this in CM too.
cm run script --tags=get,preprocessed,dataset,criteo,_multihot,_mlc -j
INFO:root:* cm run script "get preprocessed dataset criteo _multihot _mlc"
INFO:root: ! load /home/arjun/CM/repos/local/cache/6c50626de3c44bb5/cm-cached-state.json
INFO:root:{
"return": 0,
"env": {
"CM_DATASET_CRITEO_MULTIHOT": "yes",
"CM_DATASET_PREPROCESSED_CRITEO_FROM_MLC": "yes",
"CM_DATASET_PREPROCESSED_PATH": "/home/arjun/CM/repos/local/cache/e8cd121efb9946ee/dlrm_preprocessed"
},
"new_env": {
"CM_DATASET_CRITEO_MULTIHOT": "yes",
"CM_DATASET_PREPROCESSED_CRITEO_FROM_MLC": "yes",
"CM_DATASET_PREPROCESSED_PATH": "/home/arjun/CM/repos/local/cache/e8cd121efb9946ee/dlrm_preprocessed"
},
"state": {},
"new_state": {},
"deps": []
}
Great, can someone update the DLRMv2 ReadMe.
cm run script --tags=get,preprocessed,dataset,criteo,_multihot,_mlc -j
When running the command above, I get the following error:
$>cm pull repo mlcommons@ck
=========================================================================
Warning: mlcommons@ck was automatically changed to mlcommons@cm4mlops.
If you want to use older mlcommons@ck repository, use branch or checkout.
=========================================================================
=======================================================
Alias: mlcommons@cm4mlops
URL: https://github.com/mlcommons/cm4mlops
Local path: /root/CM/repos/mlcommons@cm4mlops
git pull
Already up to date.
CM alias for this repository: mlcommons@cm4mlops
=======================================================
Reindexing all CM artifacts. Can take some time ...
Took 0.5 sec.
$>cm run script --tags=get,preprocessed,dataset,criteo,_multihot,_mlc -j
INFO:root:* cm run script "get preprocessed dataset criteo _multihot _mlc"
CM error: no scripts were found with above tags and variations
variation tags ['multihot', 'mlc'] are not matching for the found script get-preprocessed-dataset-criteo with variations dict_keys(['1', '50', 'full', 'validation', 'fake', 'multihot'])
!
Hi @keithachorn-intel
We are no longer having the CM scripts in the ck
repository. pip install cm4mlops
is the recommended way to get CM scripts. Please do
cm rm repo mlcommons@ck -f
pip install cm4mlops
From the step at https://github.com/mlcommons/training/blob/3a6a379305e6ef0a1c34461c823e9c99d52c1021/recommendation_v2/torchrec_dlrm/scripts/process_Criteo_1TB_Click_Logs_dataset.sh#L42, there was a difference with the generated day_23_sparse.py using torchrec==0.3.2 resulting in an roc_auc of 61.64%
expected row_0 array([[ 10540786, 197, 34, ..., 34, 2, 3] resulting row_0 array([[ 449831406, 456128031, 780871217, ..., 374479166, 809724924, -1218975401],
The subsequent day_23_sparse_multi_hot.npz is therefore also incorrect
day_23_dense.npy and day_23_labels.npy have the correct md5 as here https://github.com/mlcommons/training/blob/3a6a379305e6ef0a1c34461c823e9c99d52c1021/recommendation_v2/torchrec_dlrm/md5sums_preprocessed_criteo_click_logs_dataset.txt#L70