razorx89 / roco-dataset

Radiology Objects in COntext (ROCO): A Multimodal Image Dataset
175 stars 19 forks source link

Need ur help #18

Closed 90000988 closed 3 months ago

90000988 commented 6 months ago

Could you please lemme know how did you downloaded the data from PubMed. And why this url is not workinghttps://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?tool=roco-fetch&email=johannes.rueckert@fh-dortmund.de&id=. Thank you.

saviola777 commented 6 months ago

Hello,

the URL is working, it just needs a PMCID, e.g.

https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?tool=roco-fetch&email=your.email@example.com&id=PMC4608653

It returns an XML response which includes the FTP link to the article, which we then use to download it.

If you are writing your own script to download articles from PMC, please make sure to use your own e-mail address in the link instead of mine.

For more information, please refer to the documentation:

90000988 commented 6 months ago

Thank you for your response. Bro Could you do me a favor? Please. I am going to do a project related to medical domain, if you don't mind guide me how to download these medical images, sem type, cuis, etc from PubMed, it was my first project,i didn't have any idea,until u send those urls, i really appreciate it. Please bro. Help me how to download these data. Thank you in advance.

On Sun, 24 Mar, 2024, 11:16 am saviola777, @.***> wrote:

Hello,

the URL is working, it just needs a PMCID, e.g.

@.***&id=PMC4608653

It returns an XML response which includes the FTP link to the article, which we then use to download it.

If you are writing your own script to download articles from PMC, please make sure to use your own e-mail address in the link instead of mine.

For more information, please refer to the documentation:

— Reply to this email directly, view it on GitHub https://github.com/razorx89/roco-dataset/issues/18#issuecomment-2016775034, or unsubscribe https://github.com/notifications/unsubscribe-auth/BHIRUESEJ5VEDIJDHR5NMPTYZ2YYLAVCNFSM6AAAAABFFBNFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJWG43TKMBTGQ . You are receiving this because you authored the thread.Message ID: @.***>

saviola777 commented 6 months ago

I'm not sure what exactly you are trying to do. If you just need the dataset, you can check out ROCOv2.

90000988 commented 6 months ago

Bro, I am going to do a project which is different from this, I need these data from PubMed, could you please guide me how to search and get this data set from PubMed please🙏.

On Sun, 24 Mar, 2024, 9:26 pm saviola777, @.***> wrote:

I'm not sure what exactly you are trying to do. If you just need the dataset, you can check out ROCOv2 https://zenodo.org/records/8333645.

— Reply to this email directly, view it on GitHub https://github.com/razorx89/roco-dataset/issues/18#issuecomment-2016948384, or unsubscribe https://github.com/notifications/unsubscribe-auth/BHIRUEXDM3Q6ULYLSE5H4XLYZ5AGVAVCNFSM6AAAAABFFBNFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJWHE2DQMZYGQ . You are receiving this because you authored the thread.Message ID: @.***>

saviola777 commented 6 months ago

The only things you can get directly from PubMed are the articles, images, and image captions.

Good luck.

90000988 commented 6 months ago

Thanks for the tip!

On Tue, 26 Mar, 2024, 11:05 am saviola777, @.***> wrote:

The only things you can get directly from PubMed are the articles, images, and image captions.

  • first you download the archives from the FTP / AWS
  • then you extract the images
  • then you extract the captions for the images from the NXML file
  • then you need to classify the images to keep only non-compound radiological images https://gitlab.com/saviola/rocov2-code/-/tree/main/roco-2018
  • then you need to clean the captions
  • then you need to extract CUIs from the captions using something like MedCAT https://github.com/CogStack/MedCAT
  • then you need to filter CUIs otherwise you get lots of nonsense

Good luck.

— Reply to this email directly, view it on GitHub https://github.com/razorx89/roco-dataset/issues/18#issuecomment-2020132501, or unsubscribe https://github.com/notifications/unsubscribe-auth/BHIRUEW4L426MR26CSK7IX3Y2FI77AVCNFSM6AAAAABFFBNFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQGEZTENJQGE . You are receiving this because you authored the thread.Message ID: @.***>

90000988 commented 6 months ago

Bro, thanks for your reply, could I ask one more thing, when I tried to download the images from PubMed, articles are downloaded, it doesn't have images. Could you guide me please. Thanks in advance.

On Tue, 26 Mar, 2024, 11:05 am saviola777, @.***> wrote:

The only things you can get directly from PubMed are the articles, images, and image captions.

  • first you download the archives from the FTP / AWS
  • then you extract the images
  • then you extract the captions for the images from the NXML file
  • then you need to classify the images to keep only non-compound radiological images https://gitlab.com/saviola/rocov2-code/-/tree/main/roco-2018
  • then you need to clean the captions
  • then you need to extract CUIs from the captions using something like MedCAT https://github.com/CogStack/MedCAT
  • then you need to filter CUIs otherwise you get lots of nonsense

Good luck.

— Reply to this email directly, view it on GitHub https://github.com/razorx89/roco-dataset/issues/18#issuecomment-2020132501, or unsubscribe https://github.com/notifications/unsubscribe-auth/BHIRUEW4L426MR26CSK7IX3Y2FI77AVCNFSM6AAAAABFFBNFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQGEZTENJQGE . You are receiving this because you authored the thread.Message ID: @.***>

saviola777 commented 6 months ago

I guess you are downloading the wrong archives.

$ wget -r ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/96/5a/PMC1660580.tar.gz […] 2024-03-27 11:42:12 (1,20 MB/s) - ‘ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/96/5a/PMC1660580.tar.gz’ saved [701386] $ tar xvf ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/96/5a/PMC1660580.tar.gz
PMC1660580/ PMC1660580/1746-160X-2-40-6.jpg PMC1660580/1746-160X-2-40-2.gif PMC1660580/1746-160X-2-40-3.jpg PMC1660580/1746-160X-2-40-6.gif PMC1660580/1746-160X-2-40-8.jpg PMC1660580/1746-160X-2-40-7.jpg PMC1660580/1746-160X-2-40-8.gif PMC1660580/1746-160X-2-40-4.gif PMC1660580/1746-160X-2-40-4.jpg PMC1660580/1746-160X-2-40-1.jpg PMC1660580/1746-160X-2-40-3.gif PMC1660580/1746-160X-2-40-5.gif PMC1660580/1746-160X-2-40-5.jpg PMC1660580/1746-160X-2-40-2.jpg PMC1660580/1746-160X-2-40.pdf PMC1660580/1746-160X-2-40-7.gif PMC1660580/1746-160X-2-40-1.gif PMC1660580/1746-160X-2-40.nxml

All images and nxml are there.