mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.18k stars 518 forks source link

DLRM v2 Preprocessed Multihot Criteo day_23 Dataset Accuracy Drop #1784

Open taran2210 opened 1 month ago

taran2210 commented 1 month ago

From the step at https://github.com/mlcommons/training/blob/3a6a379305e6ef0a1c34461c823e9c99d52c1021/recommendation_v2/torchrec_dlrm/scripts/process_Criteo_1TB_Click_Logs_dataset.sh#L42, there was a difference with the generated day_23_sparse.py using torchrec==0.3.2 resulting in an roc_auc of 61.64%

expected row_0 array([[ 10540786, 197, 34, ..., 34, 2, 3] resulting row_0 array([[ 449831406, 456128031, 780871217, ..., 374479166, 809724924, -1218975401],

The subsequent day_23_sparse_multi_hot.npz is therefore also incorrect

day_23_dense.npy and day_23_labels.npy have the correct md5 as here https://github.com/mlcommons/training/blob/3a6a379305e6ef0a1c34461c823e9c99d52c1021/recommendation_v2/torchrec_dlrm/md5sums_preprocessed_criteo_click_logs_dataset.txt#L70

attafosu commented 1 month ago

As discussed in the WG today, it will be great if MLCommons can find a way to share the preprocessed dataset with submitters. @arjunsuresh @pgmpablo157321

arjunsuresh commented 1 month ago

@taran2210 Can you please share the md5sum of the obtained day_23_sparse.npy?

@nathanw-mlc I do have the data in google drive. But do we need any extra permission to share the criteo preprocessed dataset with MLCommons members? This is their licensing.

taran2210 commented 1 month ago

@arjunsuresh md5sum for day_23_sparse.npy: d6c11a8cebb5cabcea6ff67646641f3f

nathanw-mlc commented 1 month ago

But do we need any extra permission to share the criteo preprocessed dataset with MLCommons members? This is their licensing.

This is a question for @swasson488

From what I can tell, we can provide our own public download, so long as we give appropriate credit to the original creator of the dataset, provide a link to the license, and indicate if we made any changes to the dataset. However, I don't know the legal details regarding the Non-Commercial requirement.

swasson488 commented 1 month ago

@nathanw-mlc Based on the license language, I'm comfortable with MLCommons sharing the pre-processed version of the data set with our members. We need to display the following text or include it with the files for download:

-

This is a preprocessed version of the Criteo 1TB Click Logs data set for MLPerf.

The data set is copyrighted by Criteo AI Lab. It was created by Criteo AI Lab and is shared with MLCommons members via the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License. A full copy of the license is available here along with the full original data set:

https://ailab.criteo.com/ressources/criteo-1tb-click-logs-dataset-for-mlperf/

Criteo AI Lab makes no warranties, express or implied, about this data set.

-

Tagging @TheKanter for his awareness.

arjunsuresh commented 1 month ago

Thank you @swasson488

@nathanw-mlc I have shared the preprocessed dataset with you. Once downloaded please do confirm the md5sum

c46b7e31ec6f2f8768fa60bdfc0f6e40  day_23_sparse_multi_hot.npz
nathanw-mlc commented 1 month ago

I have shared the preprocessed dataset with you. Once downloaded please do confirm the md5sum

Uploading the data direct to the Inference Cloudflare R2 bucket now. Will run md5sum checks when complete.

We need to display the following text or include it with the files for download

I've included this file in the directory: https://inference.mlcommons-storage.org/dlrm_preprocessed/README.txt

keithachorn-intel commented 1 month ago

Thank you, Nathan. For some reason, the link is getting flagged by our internal security (requested a white-list exception). But if the file is not too large, can you possibly paste the text here in a comment?

nathanw-mlc commented 1 month ago

Hey @keithachorn-intel

The text is just the text provided by @swasson488:

This is a preprocessed version of the Criteo 1TB Click Logs data set for MLPerf.

The data set is copyrighted by Criteo AI Lab. It was created by Criteo AI Lab and is shared with MLCommons members via the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License. A full copy of the license is available here along with the full original data set:

https://ailab.criteo.com/ressources/criteo-1tb-click-logs-dataset-for-mlperf/

Criteo AI Lab makes no warranties, express or implied, about this data set.

The text will be automatically downloaded along with the data, so users won't have to view it at that link.

keithachorn-intel commented 1 month ago

Gotcha. I thought the link included the instructions for accessing the processed dataset. We'll await your upload/checksum updates. Thanks!

keithachorn-intel commented 1 month ago

Hi @nathanw-mlc . I know you're working on this and the size of the dataset will require time. But for our internal planning purposes, do you have an ETA for when the processed dataset will be accessible? Thanks!

nathanw-mlc commented 1 month ago

To run Rclone on Windows, you can download the executable here. To install Rclone on Linux/macOS/BSD systems, run:

sudo -v ; curl https://rclone.org/install.sh | sudo bash

Once Rclone is installed, run the following command to authenticate with the bucket:

rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com

You can then navigate in the terminal to your desired download directory and run the following command to download the preprocessed dataset:

rclone copy mlc-inference:mlcommons-inference-wg-public/dlrm_preprocessed ./ -P
nathanw-mlc commented 1 month ago

Checksum matches; all good to go

arjunsuresh commented 1 month ago

Thanks a lot @nathanw-mlc . The total download is ~150GB. The preprocessing required ~6.4TB disk space :)

arjunsuresh commented 1 month ago

We have added this in CM too.

cm run script --tags=get,preprocessed,dataset,criteo,_multihot,_mlc  -j
INFO:root:* cm run script "get preprocessed dataset criteo _multihot _mlc"
INFO:root:     ! load /home/arjun/CM/repos/local/cache/6c50626de3c44bb5/cm-cached-state.json
INFO:root:{
  "return": 0,
  "env": {
    "CM_DATASET_CRITEO_MULTIHOT": "yes",
    "CM_DATASET_PREPROCESSED_CRITEO_FROM_MLC": "yes",
    "CM_DATASET_PREPROCESSED_PATH": "/home/arjun/CM/repos/local/cache/e8cd121efb9946ee/dlrm_preprocessed"
  },
  "new_env": {
    "CM_DATASET_CRITEO_MULTIHOT": "yes",
    "CM_DATASET_PREPROCESSED_CRITEO_FROM_MLC": "yes",
    "CM_DATASET_PREPROCESSED_PATH": "/home/arjun/CM/repos/local/cache/e8cd121efb9946ee/dlrm_preprocessed"
  },
  "state": {},
  "new_state": {},
  "deps": []
}
nathanw-mlc commented 1 month ago

Great, can someone update the DLRMv2 ReadMe.

keithachorn-intel commented 1 month ago
cm run script --tags=get,preprocessed,dataset,criteo,_multihot,_mlc  -j

When running the command above, I get the following error:

$>cm pull repo mlcommons@ck
=========================================================================
Warning: mlcommons@ck was automatically changed to mlcommons@cm4mlops.
If you want to use older mlcommons@ck repository, use branch or checkout.
=========================================================================
=======================================================
Alias:    mlcommons@cm4mlops
URL:      https://github.com/mlcommons/cm4mlops

Local path: /root/CM/repos/mlcommons@cm4mlops

git pull

Already up to date.

CM alias for this repository: mlcommons@cm4mlops
=======================================================

Reindexing all CM artifacts. Can take some time ...
Took 0.5 sec.
$>cm run script --tags=get,preprocessed,dataset,criteo,_multihot,_mlc -j
INFO:root:* cm run script "get preprocessed dataset criteo _multihot _mlc"

CM error: no scripts were found with above tags and variations

variation tags ['multihot', 'mlc'] are not matching for the found script get-preprocessed-dataset-criteo with variations dict_keys(['1', '50', 'full', 'validation', 'fake', 'multihot'])
!
arjunsuresh commented 1 month ago

Hi @keithachorn-intel

We are no longer having the CM scripts in the ck repository. pip install cm4mlops is the recommended way to get CM scripts. Please do

cm rm repo mlcommons@ck -f
pip install cm4mlops