superjamessyx / Generative-Foundation-AI-Assistant-for-Pathology

79 stars 4 forks source link

Data and model release ? #1

Open OmkarThawakar opened 1 year ago

OmkarThawakar commented 1 year ago

Dear Authors,

Thanks for the wonderful work. When are you releasing the PathCap and PathInstruct dataset and the model checkpoints ?

superjamessyx commented 1 year ago

Dear Authors,

Thanks for the wonderful work. When are you releasing the PathCap and PathInstruct dataset and the model checkpoints ?

Hello, we have now released all the tools for data generation (sub-image detection, pathology image selection, pathology caption split & refine llama tool, and image-text extraction tool), as well as the preprocessed data from PubMed (https://huggingface.co/datasets/jamessyx/pubmed_pathology_single_nocommon, https://huggingface.co/datasets/jamessyx/pubmed_pathology_single_common).

We are currently processing pathology data from Laion (which was not included before) to expand both PathCap and PathInstruct, aiming to reach a volume of over 250K+. By the end of this month, we will release the complete data set that includes Laion and the CLIP model trained on this data. Stay tuned for more updates.

Meijian commented 1 year ago

@superjamessyx Just wanted to follow up on this issue. Will the 8 pathology models for comprehending images be released on Hugging Face for testing purposes? If they will, is there a timeline for it? Thanks!

zhyhan commented 11 months ago

Thanks for the foundation work, where can we download the complete data set?

sebastianffx commented 10 months ago

@superjamessyx Following up on this, would be the complete dataset will be available soon? If so, what would be the usage licence and T&C for usage? Thanks!

superjamessyx commented 10 months ago

@superjamessyx Following up on this, would be the complete dataset will be available soon? If so, what would be the usage licence and T&C for usage? Thanks!

Yes, the data will be made publicly available shortly. There will be three parts to this data:

  1. Approximately 220K processed samples from PubMed will be released directly. I believe that half of this data can be used for commercial purposes, while the other half is strictly for academic use. I will verify and label them accordingly.
  2. About 10K data points from professional pathology websites. Due to copyright restrictions, we'll release the captions and their respective links, enabling you to download the images independently.
  3. The final portion relates to books. We'll provide a list of book titles, and you can employ our tool for further processing.
msobrevillac commented 9 months ago

Hello!

thanks for the foundation work! I would like to know what was the pdf-to-html tool you used. I tried a sample with pdftohtml and run your code but I could not make it work.

Thanks!

superjamessyx commented 9 months ago

Hello!

thanks for the foundation work! I would like to know what was the pdf-to-html tool you used. I tried a sample with pdftohtml and run your code but I could not make it work.

Thanks!

Sorry for the confusion. Before converting to HTML, we first use Adobe to convert the PDF into a Word document. Then, we use Pandoc to convert the Word document into HTML. The details are already updated in the corresponding readme. Feel free to ask if you have any other questions later on.

ZeyuGaoAi commented 5 months ago

@superjamessyx Following up on this, would the weights of YOLOv7 be released on hugging face soon? Thanks for your contribution!!

superjamessyx commented 5 months ago

@superjamessyx Following up on this, would the weights of YOLOv7 be released on hugging face soon? Thanks for your contribution!! Hi, the weights for YOLOv7 are already included in this GitHub . Since the weights are relatively small, we directly placed them in the repository for convenience. Feel free to use them!