[dlrm] dataset download , preprocessed data

dc0953 commented 3 years ago

inference dlrm 에 진행한 dataset과 동일
inference 와 다른 전처리 데이터 사용으로 인해 새로 전처리 진행
https://alab.criteo.com/ressources/criteo-1tb-click-logs-dataset-for-mlperf/
전처리된 데이터 생성
전처리 스크립트는 따로 없음
프레임워크는 hugectr 사용

1.2 Clone the reference implementation repository.

git clone https://github.com/facebookresearch/dlrm/
cd dlrm
git checkout mlperf

1.3 Build and run the reference docker image.

docker build -t dlrm_reference .
docker run -it --rm --network=host --ipc=host --shm-size=1g --ulimit memlock=-1 \
           --ulimit stack=67108864 --gpus=all  -v /data:/data dlrm_reference

1.4 Run the training script to obtain the preprocessed data. This process can take up to several days and needs a few TB of fast storage space. As a result, files named: "day_train.bin" and "day_test.bin" will be created.

After creating the preprocessed dataset the script will start training using the reference implementation. This will be clearly visible in the logs e.g., by the script printing: "Finished training it" etc. This can be safely interrupted with "Ctrl+C" as we only need this script to produce the preprocessed data and not to complete the full training run.

python dlrm_s_pytorch.py --arch-sparse-feature-size=128 --arch-mlp-bot="13-512-256-128" \
       --arch-mlp-top="1024-1024-512-256-1" --max-ind-range=40000000 --data-generation=dataset \
       --data-set=terabyte --raw-data-file=/data/day --processed-data-file=/data/day --loss-function=bce \
       --round-targets=True --learning-rate=1.0 --mini-batch-size=2048 --print-freq=2048 --print-time \
       --test-freq=102400 --test-mini-batch-size=16384 --test-num-workers=16 --memory-map --mlperf-logging \
       --mlperf-auc-threshold=0.8025 --mlperf-bin-loader --mlperf-bin-shuffle \
       --mlperf-coalesce-sparse-grads --use-gpu

[x] dataset 다운로드 완료
[x] git clone dlrm
[x] docker build
[ ] preprocess

dc0953 commented 3 years ago

preprocess error

python dlrm_s_pytorch.py --arch-sparse-feature-size=128 --arch-mlp-bot="13-512-256-128" \
       --arch-mlp-top="1024-1024-512-256-1" --max-ind-range=40000000 --data-generation=dataset \
       --data-set=terabyte --raw-data-file=/data/day --processed-data-file=/data/day --loss-function=bce \
       --round-targets=True --learning-rate=1.0 --mini-batch-size=2048 --print-freq=2048 --print-time \
       --test-freq=102400 --test-mini-batch-size=16384 --test-num-workers=16 --memory-map --mlperf-logging \
       --mlperf-auc-threshold=0.8025 --mlperf-bin-loader --mlperf-bin-shuffle \
       --use-gpu

에러 코드

"ERROR: Criteo Terabyte Dataset path is invalid; please download from https://labs.criteo.com/2013/12/download-terabyte-click-logs")

확장자가 .gz 으로 되어야 하나, 압축 해제 한 파일로 전처리 진행 시 발생되는 에러

 # WARNING: The raw data consist of day_0.gz,... ,day_23.gz text files
            # Each line in the file is a sample, consisting of 13 continuous and
            # 26 categorical features (an extra space indicates that feature is
            # missing and will be interpreted as 0).
            for i in range(days):
                datafile_i = datafile + "_" + str(i)  # + ".gz"
                if path.exists(str(datafile_i)):
                    print("Reading data from path=%s" % (str(datafile_i)))
                    # file day_<number>
                    total_per_file_count = 0
                    with open(str(datafile_i)) as f:
                        for _ in f:
                            total_per_file_count += 1
                    total_per_file.append(total_per_file_count)
                    total_count += total_per_file_count
                else:
                    sys.exit("ERROR: Criteo Terabyte Dataset path is invalid; please download from https://labs.criteo.com/2013/12/download-terabyte-click-logs")

dc0953 commented 3 years ago

전처리 진행중

ltechkorea / training_results_v1.1-pre

[dlrm] dataset download , preprocessed data #14

preprocess error