Issues about the VOA image-text pairs dataset fetching

MartinYuanNJU commented 1 year ago

Thanks for your wonderful work! I wonder how you get the whole images of the VOA image-caption dataset you used for pre-training. Do you use https://github.com/limanling/m2e2/blob/master/src/dataflow/numpy/dataset_image_download.py this code provided by the author of the VOA image-text pairs dataset to download these images? The downloading speed shows it will take me around 900 hours to finish, which is really slow. I ask a friend of mine to help me download this dataset, he is in U.S. now, he tells me the downloading speed shows it will take about 400 hours to download these images, which is still slow. I wonder whether you have got the whole dataset or have a faster way to download these images, thanks a lot!

Kuangdd01 commented 1 year ago

hello, maybe you can try to download VOA files in multiprocessing way? Here is the codes I used before:

import os
import requests
import json
from tqdm import tqdm
from PIL import Image
from PIL import ImageFile

ImageFile.LOAD_TRUNCATED_IMAGES = True
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}

def download_image(url_path_tuple):
    url_image, path_save = url_path_tuple
    r = requests.get(url_image, headers=headers)
    with open(path_save, "wb") as f:
        f.write(r.content)
    im = Image.open(path_save)
    im.save(path_save, quality=20)

def download_image_list(meta_json, dir_save):
    if not os.path.exists(meta_json):
        print('[ERROR] input_metadata_json does not exist.')
    metadata = json.load(open(meta_json))
    url_list = []
    save_paths = []
    for doc_id in tqdm(metadata):
        for img_id in metadata[doc_id]:
            url_image = metadata[doc_id][img_id]['url']
            suffix_image = metadata[doc_id][img_id]['url'].split('.')[-1]
            image_path_save = os.path.join(dir_save, '%s_%s.%s' % (doc_id, img_id, suffix_image))
            url_list.append(url_image)
            save_paths.append(image_path_save)
    return url_list, save_paths

if __name__ == '__main__':
    input_metadata_json = "./voa_img_dataset.json"
    output_image_dir = "./voa_image"
    url_list, save_paths = download_image_list(input_metadata_json, output_image_dir)

    executor = ThreadPoolExecutor(max_workers=30)
    future_tasks = [executor.submit(download_image, url_path_tuple) for url_path_tuple in zip(url_list, save_paths)]
    wait(future_tasks, return_when=ALL_COMPLETED)

    print("All images download complete.")

MartinYuanNJU commented 1 year ago

@Kuangdd01 Thanks a lot! About one week ago I found that the previous code provided by m2e2 using only single thread, then I tried to change the download code to multi-thread version and it remarkably boosted the downloading speed. At that time I forgot to close this issue, but anyway thanks for your advice, I think in the future there will be someone else having the same problem which can be solved by your suggestion. Thanks for your help!

Kuangdd01 commented 1 year ago

@Kuangdd01 Thanks a lot! About one week ago I found that the previous code provided by m2e2 using only single thread, then I tried to change the download code to multi-thread version and it remarkably boosted the downloading speed. At that time I forgot to close this issue, but anyway thanks for your advice, I think in the future there will be someone else having the same problem which can be solved by your suggestion. Thanks for your help!

You are welcome. By the way, would you like to tell me that have you finished reproducing the related task WASE and this code repository?

MartinYuanNJU commented 1 year ago

@Kuangdd01 Thanks a lot! About one week ago I found that the previous code provided by m2e2 using only single thread, then I tried to change the download code to multi-thread version and it remarkably boosted the downloading speed. At that time I forgot to close this issue, but anyway thanks for your advice, I think in the future there will be someone else having the same problem which can be solved by your suggestion. Thanks for your help!

You are welcome. By the way, would you like to tell me that have you finished reproducing the related task WASE and this code repository?

Sorry, I've just downloaded the dataset and haven't started the work for reproducing this two papers.

jianliu-ml / Multimedia-EE

Issues about the VOA image-text pairs dataset fetching #4