Open vicilliar opened 1 year ago
Are you able to easily test pytorch 2 as well?
Yes, I will do so and send the results here.
I tested the latest nightly build (Torch 2.1.0) using this exact method. Here are the updated packages I used:
torch==2.1.0.dev20230820+cu121
torchaudio==2.1.0.dev20230821+cu121
torchvision==0.16.0.dev20230821+cu121
Generally, my results are the same. Pytorch 2.1 is worse than 1.12.1 in terms of text encoding RPS, when using threads > 3. Is there an explanation for this?
Comparison between 1.12.1+cu113 and 2.1.0+cu121 (5 threads)
Torch version | 1.12.1 | 2.1.0
RPS | 119.96 | 96.2299
Latency (s) | 0.03786 | 0.04894
Overall RPS table comparing different versions:
If single thread perf is good, is it possible that we're failing to use multiple threads on the recent version? Similar problems: https://github.com/pytorch/pytorch/issues/99625
🐛 Describe the bug
There is a significant speed degradation in encoding text (using model
ViT-B-32/laion2b_s34b_b79k
) with multiple threads when upgrading PyTorch versions. I have been encoding both images and text with open CLIP models, and have found that when upgrading from Torch 1.12.1 to 1.13.0, encoding latency increases significantly when using multiple threads. Here is sample data collected with 5 threads:Text Encoding mean latency comparison (5 threads)
Text Encoding Requests per second comparison (5 threads)
This degradation does not occur when encoding with a single thread or when encoding images. Does anyone have an explanation why performance would degrade when upgrading my version?
To recreate:
Start machine with torch 1.12.1 installed
Start another machine with torch 1.13.0 installed
Install other requirements on both machines
Run the following script, which encodes entries from the COCO dataset, on both machines:
print("torch version:", torch.version) print("CUDA version:", torch.version.cuda)
Test 1
TARGET_MODEL = "ViT-B-32" TARGET_PRETRAINED = "laion2b_s34b_b79k"
DEVICE = "cuda"
Configurable parameters
parser = argparse.ArgumentParser(description='Pytorch Performance Test RPS Script') parser.add_argument('--Threads', help='The number of threads', default = 5, type = int) parser.add_argument('--Requests', help='The number of requests', default = 100, type = int) args = parser.parse_args() NUM_THREADS = args.Threads NUM_REQUESTS = args.Requests
model, _, transform = create_model_and_transforms(model_name=TARGET_MODEL, pretrained=TARGET_PRETRAINED, device=DEVICE) tokenizer = get_tokenizer(TARGET_MODEL)
def inference_time_on_image(image_path: str) -> float: image = load_image_from_path(image_path) processed_image = transform(image).unsqueeze(0).to(DEVICE) start = time.time() with torch.no_grad(): if DEVICE.startswith("cuda"): with torch.cuda.amp.autocast(): image_features = model.encode_image(processed_image) else: image_features = model.encode_image(processed_image) elapsed_time = time.time() - start return elapsed_time
def load_image_from_path(image_path: str): """Loads an image into PIL from a string path that is either local or a url Args: image_path (str): Local or remote path to image. Returns: ImageType: In-memory PIL image. """ if os.path.isfile(image_path): img = Image.open(image_path) elif validators.url(image_path): with requests.get(image_path, stream=True) as resp: img = Image.open(resp.raw) return img
def inference_time_on_text(text: str) -> float: processed_text = tokenizer(text).to(DEVICE) start = time.time() with torch.no_grad(): if DEVICE.startswith("cuda"): with torch.cuda.amp.autocast(): text_features = model.encode_text(processed_text) else: text_features = model.encode_text(processed_text) elapsed_time = time.time() - start return elapsed_time
class RequestThread(threading.Thread): def init(self, queries, type):
Should be given list of queries (generated beforehand)
def warm_upcalls(): = inference_time_ontext("hello world") = inference_time_on_image( "https://raw.githubusercontent.com/marqo-ai/marqo/mainline/examples/ImageSearchGuide/data/image1.jpg")
def data_processed(time_list: List[float]) -> Tuple: sample_size = len(time_list)
def download_util( url: str, cache_dir: str = "./", ): buffer_size = 8192 if not cache_dir: cache_dir = os.path.expanduser(ModelCache.clip_cache_path) os.makedirs(cache_dir, exist_ok=True) filename = os.path.basename(url)
#######################################################
LOADING COCO DATASET
#######################################################
tmppath = './tmp/' dataDir='./tmp' dataType='train2014' annFile='{}/annotations/instances{}.json'.format(dataDir,dataType) capsFile = '{}/annotations/captions_{}.json'.format(dataDir,dataType)
annotations_url = "http://images.cocodataset.org/annotations/annotations_trainval2014.zip" annotations_extract_path = f'{tmp_path}annotations' if os.path.exists(annotations_extract_path): print('Annotations already exist on disk. Skipping.') else: annotations_zip = download_util(annotations_url, tmp_path) with zipfile.ZipFile(annotations_zip, 'r') as zf: zf.extractall(tmp_path)
print('Annotations downloaded and extracted')
print("Loading COCO Dataset") coco=COCO(annFile) print("Loading COCO Caps File") coco_caps = COCO(capsFile) print("Loading COCO ANN File") coco_anns = COCO(annFile) img_id_list = coco.getImgIds()
def rps_test_for_type(type: str): if type not in ["text", "image"]: raise Exception("Invalid type")
def random_query(): """ Get a random ID and get the text and image for that ID """ img_id = random.choice(img_id_list) text = coco_caps.loadAnns(coco_caps.getAnnIds(img_id))[0]["caption"] image = coco_anns.loadImgs(img_id)[0]["coco_url"] return text, image
print('Generating random queries') queries = { "text": [], "image": [] } for i in tqdm(range(NUM_THREADS)): thread_text_queries = [] thread_imagequeries = [] for in tqdm(range(NUM_REQUESTS)): text, image = random_query() thread_text_queries.append(text) thread_image_queries.append(image) print(f"Generated (FOR THREAD {i}): {len(thread_text_queries)} texts and {len(thread_image_queries)} images.") queries["text"].append(thread_text_queries) queries["image"].append(thread_image_queries) print('Done generating queries')
def main(): warm_up_calls()
if name == "main": main()
PyTorch version: 1.13.0+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.27.1 Libc version: glibc-2.31
Python version: 3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.11.0-1022-aws-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 11.7.64 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla T4 Nvidia driver version: 515.43.04 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Stepping: 7 CPU MHz: 2499.998 BogoMIPS: 4999.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 4 MiB L3 cache: 35.8 MiB NUMA node0 CPU(s): 0-7 Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full generic retpoline, STIBP disabled, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] open-clip-torch==2.18.0 [pip3] torch==1.13.0+cu117 [pip3] torchaudio==0.13.0+cu117 [pip3] torchvision==0.14.0+cu117 [pip3] triton==2.0.0 [conda] Could not collect