pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
83.96k stars 22.63k forks source link

Regression in text encoding #107363

Open vicilliar opened 1 year ago

vicilliar commented 1 year ago

🐛 Describe the bug

There is a significant speed degradation in encoding text (using model ViT-B-32/laion2b_s34b_b79k) with multiple threads when upgrading PyTorch versions. I have been encoding both images and text with open CLIP models, and have found that when upgrading from Torch 1.12.1 to 1.13.0, encoding latency increases significantly when using multiple threads. Here is sample data collected with 5 threads:

Text Encoding mean latency comparison (5 threads)

Torch version  |  1.11.0   |  1.12.1  |  1.13.0
Latency (s)    |  0.03681  | 0.03786  | 0.05414

Text Encoding Requests per second comparison (5 threads)

Torch version  |  1.11.0  |  1.12.1  |  1.13.0
RPS            |  123.76  |  119.96  |  86.82

This degradation does not occur when encoding with a single thread or when encoding images. Does anyone have an explanation why performance would degrade when upgrading my version?

To recreate:

  1. Start machine with torch 1.12.1 installed

    pip3 install --no-cache-dir torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113 --upgrade
  2. Start another machine with torch 1.13.0 installed

    pip3 install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117 --upgrade
  3. Install other requirements on both machines

    pip install open_clip_torch==2.18.0 validators cython matplotlib
    pip install git+https://github.com/philferriere/cocoapi.git#subdirectory=PythonAPI
  4. Run the following script, which encodes entries from the COCO dataset, on both machines:

    
    from PIL import Image
    from open_clip import create_model_and_transforms, get_tokenizer
    import torch
    from typing import List, Tuple
    import numpy as np
    import os
    import time
    import validators
    import requests
    from pycocotools.coco import COCO
    import random
    import zipfile
    import urllib
    from tqdm import tqdm
    import threading
    import argparse

print("torch version:", torch.version) print("CUDA version:", torch.version.cuda)

Test 1

TARGET_MODEL = "ViT-B-32" TARGET_PRETRAINED = "laion2b_s34b_b79k"

DEVICE = "cuda"

Configurable parameters

parser = argparse.ArgumentParser(description='Pytorch Performance Test RPS Script') parser.add_argument('--Threads', help='The number of threads', default = 5, type = int) parser.add_argument('--Requests', help='The number of requests', default = 100, type = int) args = parser.parse_args() NUM_THREADS = args.Threads NUM_REQUESTS = args.Requests

model, _, transform = create_model_and_transforms(model_name=TARGET_MODEL, pretrained=TARGET_PRETRAINED, device=DEVICE) tokenizer = get_tokenizer(TARGET_MODEL)

def inference_time_on_image(image_path: str) -> float: image = load_image_from_path(image_path) processed_image = transform(image).unsqueeze(0).to(DEVICE) start = time.time() with torch.no_grad(): if DEVICE.startswith("cuda"): with torch.cuda.amp.autocast(): image_features = model.encode_image(processed_image) else: image_features = model.encode_image(processed_image) elapsed_time = time.time() - start return elapsed_time

def load_image_from_path(image_path: str): """Loads an image into PIL from a string path that is either local or a url Args: image_path (str): Local or remote path to image. Returns: ImageType: In-memory PIL image. """ if os.path.isfile(image_path): img = Image.open(image_path) elif validators.url(image_path): with requests.get(image_path, stream=True) as resp: img = Image.open(resp.raw) return img

def inference_time_on_text(text: str) -> float: processed_text = tokenizer(text).to(DEVICE) start = time.time() with torch.no_grad(): if DEVICE.startswith("cuda"): with torch.cuda.amp.autocast(): text_features = model.encode_text(processed_text) else: text_features = model.encode_text(processed_text) elapsed_time = time.time() - start return elapsed_time

class RequestThread(threading.Thread): def init(self, queries, type):

Should be given list of queries (generated beforehand)

    super().__init__()
    self.queries = queries
    self.latencies = []
    self.type = type

def run(self):
    for q in self.queries:
        try:
            if self.type == "text":
                self.latencies.append(inference_time_on_text(q))
            elif self.type == "image":
                self.latencies.append(inference_time_on_image(q))
            else:
                raise Exception(f"Invalid request type: {self.type}")
        except Exception as e:
            print(f"Error ({e})")

def warm_upcalls(): = inference_time_ontext("hello world") = inference_time_on_image( "https://raw.githubusercontent.com/marqo-ai/marqo/mainline/examples/ImageSearchGuide/data/image1.jpg")

def data_processed(time_list: List[float]) -> Tuple: sample_size = len(time_list)

time_list_copy = np.copy(time_list)
mean = np.mean(time_list_copy)
p50 = np.percentile(time_list_copy, 50)
p90 = np.percentile(time_list_copy, 90)
p99 = np.percentile(time_list_copy, 99)
return sample_size, mean, p50, p90, p99

def download_util( url: str, cache_dir: str = "./", ): buffer_size = 8192 if not cache_dir: cache_dir = os.path.expanduser(ModelCache.clip_cache_path) os.makedirs(cache_dir, exist_ok=True) filename = os.path.basename(url)

download_target = os.path.join(cache_dir, filename)

if os.path.isfile(download_target):
    print(f"File already exists at {download_target}. Skipping download.")
    return download_target

print(f"About to start downloading annotations from url: {url}")
with urllib.request.urlopen(url) as source, open(download_target, "wb") as output:
    with tqdm(total=int(source.headers.get("Content-Length")), ncols=80, unit='iB', unit_scale=True) as loop:
        while True:
            buffer = source.read(buffer_size)
            if not buffer:
                break

            output.write(buffer)
            loop.update(len(buffer))
print(f"Finished downloading annotations from url: {url}")
return download_target

#######################################################

LOADING COCO DATASET

#######################################################

tmppath = './tmp/' dataDir='./tmp' dataType='train2014' annFile='{}/annotations/instances{}.json'.format(dataDir,dataType) capsFile = '{}/annotations/captions_{}.json'.format(dataDir,dataType)

annotations_url = "http://images.cocodataset.org/annotations/annotations_trainval2014.zip" annotations_extract_path = f'{tmp_path}annotations' if os.path.exists(annotations_extract_path): print('Annotations already exist on disk. Skipping.') else: annotations_zip = download_util(annotations_url, tmp_path) with zipfile.ZipFile(annotations_zip, 'r') as zf: zf.extractall(tmp_path)

print('Annotations downloaded and extracted')

print("Loading COCO Dataset") coco=COCO(annFile) print("Loading COCO Caps File") coco_caps = COCO(capsFile) print("Loading COCO ANN File") coco_anns = COCO(annFile) img_id_list = coco.getImgIds()

def rps_test_for_type(type: str): if type not in ["text", "image"]: raise Exception("Invalid type")

# Create the threads
threads: List[RequestThread] = list()
for i in range(NUM_THREADS):
    threads.append(RequestThread(queries[type][i], type))

print(f'Starting threads for type {type}')
start = time.time()
[t.start() for t in threads]
[t.join() for t in threads]
end = time.time()
elapsed = end - start

total_requests = sum(len(t.latencies) for t in threads)
latencies = [l for t in threads for l in t.latencies]

sample_size, mean, p50, p90, p99 = data_processed(latencies)
print("-----------------------------------------")
print(f"RESULTS FOR {type}")
print(f"TORCH: {torch.__version__}, CUDA: {torch.version.cuda}")
print(f"MODEL: {TARGET_MODEL}, PRETRAINED: {TARGET_PRETRAINED}, DEVICE: {DEVICE}")
print(f"NUM_REQUESTS: {NUM_REQUESTS}, NUM_THREADS: {NUM_THREADS}")
print("Number of queries inferenced:", sample_size)
print("---RPS---")
print(total_requests / elapsed)
print("---LATENCY---")
print("Total time taken (s):", elapsed)
print("Mean (s):", mean)
print("p50 (s):", p50)
print("p90 (s):", p90)
print("p99 (s):", p99)

def random_query(): """ Get a random ID and get the text and image for that ID """ img_id = random.choice(img_id_list) text = coco_caps.loadAnns(coco_caps.getAnnIds(img_id))[0]["caption"] image = coco_anns.loadImgs(img_id)[0]["coco_url"] return text, image

print('Generating random queries') queries = { "text": [], "image": [] } for i in tqdm(range(NUM_THREADS)): thread_text_queries = [] thread_imagequeries = [] for in tqdm(range(NUM_REQUESTS)): text, image = random_query() thread_text_queries.append(text) thread_image_queries.append(image) print(f"Generated (FOR THREAD {i}): {len(thread_text_queries)} texts and {len(thread_image_queries)} images.") queries["text"].append(thread_text_queries) queries["image"].append(thread_image_queries) print('Done generating queries')

def main(): warm_up_calls()

rps_test_for_type("text")
rps_test_for_type("image")

if name == "main": main()


Observe that the latency is higher for the later torch version. You can change the `--Threads` and `--Requests` args to observe the effect of concurrency on latency.

### Versions

PyTorch version: 1.13.0+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.27.1 Libc version: glibc-2.31

Python version: 3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.11.0-1022-aws-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 11.7.64 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla T4 Nvidia driver version: 515.43.04 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Stepping: 7 CPU MHz: 2499.998 BogoMIPS: 4999.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 4 MiB L3 cache: 35.8 MiB NUMA node0 CPU(s): 0-7 Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full generic retpoline, STIBP disabled, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] open-clip-torch==2.18.0 [pip3] torch==1.13.0+cu117 [pip3] torchaudio==0.13.0+cu117 [pip3] torchvision==0.14.0+cu117 [pip3] triton==2.0.0 [conda] Could not collect



cc @ptrblck
ezyang commented 1 year ago

Are you able to easily test pytorch 2 as well?

vicilliar commented 1 year ago

Yes, I will do so and send the results here.

vicilliar commented 1 year ago

I tested the latest nightly build (Torch 2.1.0) using this exact method. Here are the updated packages I used:

torch==2.1.0.dev20230820+cu121
torchaudio==2.1.0.dev20230821+cu121
torchvision==0.16.0.dev20230821+cu121

Generally, my results are the same. Pytorch 2.1 is worse than 1.12.1 in terms of text encoding RPS, when using threads > 3. Is there an explanation for this?

Comparison between 1.12.1+cu113 and 2.1.0+cu121 (5 threads)

Torch version  |   1.12.1    |   2.1.0
RPS            |   119.96    |  96.2299  
Latency (s)    |   0.03786   |   0.04894

Overall RPS table comparing different versions:

image
ezyang commented 1 year ago

If single thread perf is good, is it possible that we're failing to use multiple threads on the recent version? Similar problems: https://github.com/pytorch/pytorch/issues/99625