downloading datas is very low... - Githubissues

microsoft / LLaVA-Med

Large Language-and-Vision Assistant for Biomedicine, built towards multimodal GPT-4 level capabilities.

Other

1.59k stars 202 forks source link

downloading datas is very low... #31

Closed zhoucz97 closed 7 months ago

zhoucz97 commented 1 year ago

I was very slow to download data, especially the last image data. Would you consider putting datas on Google Cloud Drive?

LaoRann commented 1 year ago

I also think the script provided is too slow to download the all data from the sources...... A Google Cloud Drive source is necessery to get the image data.

tgisaturday commented 11 months ago

Hi guys, I've made a PR #38 that supports multiprocessing for downloading. Hope this helps.

yihp commented 10 months ago

@tgisaturday How did you download it in the end? I still have a question, do you have the complete dataset file?

yihp commented 10 months ago

@zhoucz97 How did you download it in the end? I still have a question, do you have the complete dataset file?

yihp commented 10 months ago

@LaoRann How did you download it in the end? I still have a question, do you have the complete dataset file?

yihp commented 10 months ago

@jwyang How did you download it in the end? I still have a question, do you have the complete dataset file?

yihp commented 10 months ago

@tnaumann How did you download it in the end? I still have a question, do you have the complete dataset file?

Dcas89 commented 10 months ago

Same as above, downloading the data/image files is incredibly difficult. Would be ideal if the images referenced for each stage (e.g. alignment, instruction tuning, eval) are organised into their respective categories.

ankithbala commented 9 months ago

What is the download size of this dataset? Do we need all PMC 15M for the "Stage 1 (Optional): Medical Concept Alignment" ?

displaywz commented 9 months ago

该数据集的下载大小是多少？我们是否需要所有 PMC 15M 来进行“第一阶段（可选）：医学概念调整”？

As far as I know, Microsoft did not allow PMC15M to be made public, we need to download the original file through the script and get the Image for training and fine tuning

CinKKKyo commented 7 months ago

该数据集的下载大小是多少？我们是否需要所有 PMC 15M 来进行“第一阶段（可选）：医学概念调整”？

As far as I know, Microsoft did not allow PMC15M to be made public, we need to download the original file through the script and get the Image for training and fine tuning

I'm trying to download the raw PMC paper and image data in the way provided in the project, and it looks very large in size? Because I see that there are 720,000 samples in total, but I'm only downloading the tar.gz zip file of 120,000 of them right now, and just that part of the zip file alone consumes almost 500G of data disk storage space... The file size of the raw data is so large that I'm not sure if I'm doing it right...

displaywz commented 7 months ago

该数据集的下载大小是多少？我们是否需要所有 PMC 15M 来进行“第一阶段（可选）：医学概念调整”？

As far as I know, Microsoft did not allow PMC15M to be made public, we need to download the original file through the script and get the Image for training and fine tuning

I'm trying to download the raw PMC paper and image data in the way provided in the project, and it looks very large in size? Because I see that there are 720,000 samples in total, but I'm only downloading the tar.gz zip file of 120,000 of them right now, and just that part of the zip file alone consumes almost 500G of data disk storage space... The file size of the raw data is so large that I'm not sure if I'm doing it right...

This is not a good idea. I attempted to improve the downloading script by enabling parallel downloads to increase speed. However, it still requires close to a week to download, totaling nearly 3.6TB of compressed PMC files. Yet, the actual usable data consists of only a few tens of gigabytes of images, which is significantly less than the data provided in the Llava-med-preview repository. Therefore, I suggest considering an alternative approach.

CinKKKyo commented 7 months ago

数据集的下载大小是多少？我们是否需要所有PMC 15M来进行“第一阶段（可选）：医学概念调整”？

据我所知，微软不允许PMC15M公开，我们需要通过脚本下载原始文件并获取Image进行训练和微调

我正在尝试按照项目中提供的方式下载原始PMC论文和图像数据，它看起来很大？因为我看到总共有72万个样本，但我现在只下载了其中12万个的tar.gz zip文件，光是这部分zip文件就消耗了近500G的数据磁盘存储空间......原始数据的文件大小太大了，我不确定我做得是否正确......

这不是一个好主意。我尝试通过启用并行下载来提高速度来改进下载脚本。不过，下载仍需要近一周的时间，总计近 3.6TB 的压缩 PMC 文件。然而，实际可用数据仅包含几十 GB 的图像，这明显少于 Llava-med-preview 存储库中提供的数据。因此，我建议考虑另一种方法。

请问您有什么好的解决方案吗？我同样以多线程的方式下载。此外，受制于服务器得到网络速率较低以及连接不稳定导致部分文件下载失败，我使用wget -C进行断点续传，但即便如此有一些大文件(几百M-1G左右不等)仍然不能一次性下完

displaywz commented 7 months ago

数据集的下载大小是多少？我们是否需要所有PMC 15M来进行“第一阶段（可选）：医学概念调整”？

据我所知，微软不允许PMC15M公开，我们需要通过脚本下载原始文件并获取Image进行训练和微调

我正在尝试按照项目中提供的方式下载原始PMC论文和图像数据，它看起来很大？因为我看到总共有72万个样本，但我现在只下载了其中12万个的tar.gz zip文件，光是这部分zip文件就消耗了近500G的数据磁盘存储空间......原始数据的文件大小太大了，我不确定我做得是否正确......

这不是一个好主意。我尝试通过启用并行下载来提高速度来改进下载脚本。不过，下载仍需要近一周的时间，总计近 3.6TB 的压缩 PMC 文件。然而，实际可用数据仅包含几十 GB 的图像，这明显少于 Llava-med-preview 存储库中提供的数据。因此，我建议考虑另一种方法。

请问您有什么好的解决方案吗？我同样以多线程的方式下载。此外，受制于服务器得到网络速率较低以及连接不稳定导致部分文件下载失败，我使用wget -C进行断点续传，但即便如此有一些大文件(几百M-1G左右不等)仍然不能一次性下完

Following the method provided by the repository for downloading may not be a good idea, as the resulting data may not achieve the pre training effect provided by the repository. The script I used to download data is on the disk in the laboratory. I will find it later and share it with you. If communication is convenient, you can leave your WeChat ID.

Eldo-rado commented 6 months ago

数据集的下载大小是多少？我们是否需要所有PMC 15M来进行“第一阶段（可选）：医学概念调整”？

据我所知，微软不允许PMC15M公开，我们需要通过脚本下载原始文件并获取Image进行训练和微调

我正在尝试按照项目中提供的方式下载原始PMC论文和图像数据，它看起来很大？因为我看到总共有72万个样本，但我现在只下载了其中12万个的tar.gz zip文件，光是这部分zip文件就消耗了近500G的数据磁盘存储空间......原始数据的文件大小太大了，我不确定我做得是否正确......

这不是一个好主意。我尝试通过启用并行下载来提高速度来改进下载脚本。不过，下载仍需要近一周的时间，总计近 3.6TB 的压缩 PMC 文件。然而，实际可用数据仅包含几十 GB 的图像，这明显少于 Llava-med-preview 存储库中提供的数据。因此，我建议考虑另一种方法。

请问您有什么好的解决方案吗？我同样以多线程的方式下载。此外，受制于服务器得到网络速率较低以及连接不稳定导致部分文件下载失败，我使用wget -C进行断点续传，但即便如此有一些大文件(几百M-1G左右不等)仍然不能一次性下完

Following the method provided by the repository for downloading may not be a good idea, as the resulting data may not achieve the pre training effect provided by the repository. The script I used to download data is on the disk in the laboratory. I will find it later and share it with you. If communication is convenient, you can leave your WeChat ID.

Hi! 👋
Thank you for your suggestion. I would like to confirm if the author did not release the weights for the first stage of training. Additionally, if it's convenient, could you please send me your WeChat ID to my email address, 1289560160@qq.com?

zihui-debug commented 6 months ago

数据集的下载大小是多少？我们是否需要所有PMC 15M来进行“第一阶段（可选）：医学概念调整”？

据我所知，微软不允许PMC15M公开，我们需要通过脚本下载原始文件并获取Image进行训练和微调

我正在尝试按照项目中提供的方式下载原始PMC论文和图像数据，它看起来很大？因为我看到总共有72万个样本，但我现在只下载了其中12万个的tar.gz zip文件，光是这部分zip文件就消耗了近500G的数据磁盘存储空间......原始数据的文件大小太大了，我不确定我做得是否正确......

这不是一个好主意。我尝试通过启用并行下载来提高速度来改进下载脚本。不过，下载仍需要近一周的时间，总计近 3.6TB 的压缩 PMC 文件。然而，实际可用数据仅包含几十 GB 的图像，这明显少于 Llava-med-preview 存储库中提供的数据。因此，我建议考虑另一种方法。

请问您有什么好的解决方案吗？我同样以多线程的方式下载。此外，受制于服务器得到网络速率较低以及连接不稳定导致部分文件下载失败，我使用wget -C进行断点续传，但即便如此有一些大文件(几百M-1G左右不等)仍然不能一次性下完

Following the method provided by the repository for downloading may not be a good idea, as the resulting data may not achieve the pre training effect provided by the repository. The script I used to download data is on the disk in the laboratory. I will find it later and share it with you. If communication is convenient, you can leave your WeChat ID.

Hello, can I contact you for the script to download the data? please send me your WeChat ID to my email 2192325557@qq.com if convenient. Thanks!

linxinda commented 5 months ago

数据集的下载大小是多少？我们是否需要所有PMC 15M来进行“第一阶段（可选）：医学概念调整”？

据我所知，微软不允许PMC15M公开，我们需要通过脚本下载原始文件并获取Image进行训练和微调

我正在尝试按照项目中提供的方式下载原始PMC论文和图像数据，它看起来很大？因为我看到总共有72万个样本，但我现在只下载了其中12万个的tar.gz zip文件，光是这部分zip文件就消耗了近500G的数据磁盘存储空间......原始数据的文件大小太大了，我不确定我做得是否正确......

这不是一个好主意。我尝试通过启用并行下载来提高速度来改进下载脚本。不过，下载仍需要近一周的时间，总计近 3.6TB 的压缩 PMC 文件。然而，实际可用数据仅包含几十 GB 的图像，这明显少于 Llava-med-preview 存储库中提供的数据。因此，我建议考虑另一种方法。

请问您有什么好的解决方案吗？我同样以多线程的方式下载。此外，受制于服务器得到网络速率较低以及连接不稳定导致部分文件下载失败，我使用wget -C进行断点续传，但即便如此有一些大文件(几百M-1G左右不等)仍然不能一次性下完

Following the method provided by the repository for downloading may not be a good idea, as the resulting data may not achieve the pre training effect provided by the repository. The script I used to download data is on the disk in the laboratory. I will find it later and share it with you. If communication is convenient, you can leave your WeChat ID.

Hello, can I contact you for the script to download the data? please send me your WeChat ID to my email 873614651@qq.com if convenient. Thanks!

ZL315120310 commented 4 months ago

数据集的下载大小是多少？我们是否需要所有PMC 15M来进行“第一阶段（可选）：医学概念调整”？

据我所知，微软不允许PMC15M公开，我们需要通过脚本下载原始文件并获取Image进行训练和微调

我正在尝试按照项目中提供的方式下载原始PMC论文和图像数据，它看起来很大？因为我看到总共有72万个样本，但我现在只下载了其中12万个的tar.gz zip文件，光是这部分zip文件就消耗了近500G的数据磁盘存储空间......原始数据的文件大小太大了，我不确定我做得是否正确......

这不是一个好主意。我尝试通过启用并行下载来提高速度来改进下载脚本。不过，下载仍需要近一周的时间，总计近 3.6TB 的压缩 PMC 文件。然而，实际可用数据仅包含几十 GB 的图像，这明显少于 Llava-med-preview 存储库中提供的数据。因此，我建议考虑另一种方法。

请问您有什么好的解决方案吗？我同样以多线程的方式下载。此外，受制于服务器得到网络速率较低以及连接不稳定导致部分文件下载失败，我使用wget -C进行断点续传，但即便如此有一些大文件(几百M-1G左右不等)仍然不能一次性下完

Following the method provided by the repository for downloading may not be a good idea, as the resulting data may not achieve the pre training effect provided by the repository. The script I used to download data is on the disk in the laboratory. I will find it later and share it with you. If communication is convenient, you can leave your WeChat ID.

Hello, can I contact you for the script to download the data? please send me your WeChat ID to my email 315120310@qq.com if convenient. Thanks!