mims-harvard / TDC

Therapeutics Commons (TDC-2): Multimodal Foundation for Therapeutic Science
https://tdcommons.ai
MIT License
1.02k stars 174 forks source link

Error happened with load data #314

Open StefanIsSmart opened 2 months ago

StefanIsSmart commented 2 months ago

Describe the bug The bug was happened while loading the data

To Reproduce Steps to reproduce the behavior:

from tdc.single_pred import Yields data = Yields(name = 'Buchwald-Hartwig') split = data.get_split()

Expected behavior

get a dataframe

Screenshots 截屏2024-09-16 上午10 33 08

Environment:

Additional context ![Uploading 截屏2024-09-16 上午10.34.25.png…]()

flogrammer commented 1 month ago

I'm having the same issue image

Any ideas?

flogrammer commented 1 month ago

The problem seems to be that the downloaded zinc.tab file is empty (in my case zinc)

mxfly14 commented 1 month ago

Hi, I have the same issue (same message and an empty .tab file). And when I run it in my terminal I got this : image Maybe it is a bad request to https://dataverse.harvard.edu/ ?

Arslan-Masood commented 1 month ago

If you just want to download the data, directly download from here https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/21LKWG

jepdavidson commented 1 month ago

Hi,

I am seeing the same (misleading) "TDC is hosted in Harvard Dataverse and it is currently under maintenance" message. As @flogrammer and @mxfly14 said, this appears to be due to empty files being retrieved.

The underlying cause (in my environment at least) is due to getting a 202 response instead of 200 when sending a GET request. Here's the code for the dataverse_download function (from tdc.utils.load):

def dataverse_download(url, path, name, types, id=None):
    """dataverse download helper with progress bar

    Args:
        url (str): the url of the dataset
        path (str): the path to save the dataset
        name (str): the dataset name
        types (dict): a dictionary mapping from the dataset name to the file format
    """
    if id is None:
        save_path = os.path.join(path, name + "." + types[name])
    else:
        save_path = os.path.join(path, name + "-" + str(id) + "." + types[name])
    response = requests.get(url, stream=True)
    total_size_in_bytes = int(response.headers.get("content-length", 0))
    block_size = 1024
    progress_bar = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True)
    with open(save_path, "wb") as file:
        for data in response.iter_content(block_size):
            progress_bar.update(len(data))
            file.write(data)
    progress_bar.close()

The 202 status means that response.iter_content() doesn't generate anything, and the function ends-up writing an empty file. The 202 status can be simply reproduced like this:

import requests
r = requests.get("https://dataverse.harvard.edu/api/access/datafile/4267146")
print(r.status_code)

202

Strangely, the same behaviour is not observed when running in a Google colab environment (I haven't figured-out why that is yet!). image

Kind regards

James