ibm-aur-nlp / PubLayNet

Other
915 stars 164 forks source link

raw pdf files #3

Open kailigo opened 5 years ago

kailigo commented 5 years ago

Nice work and appreciated for the effort for making the dataset publicly available.

I am working document object detection now and would like to utilize the content of the document to help boost detection performance. So, I need the raw pdf files (from which you generates images ) to extract some content information. Could you please release them as well?

I think one of the major differences of object detection in document images and natural images is documents contain auxiliary text information absent in natural images. Incorporating this auxiliary information should help reach better detection results. It also should benefit for some NLP+CV tasks. Thanks.

zhxgj commented 5 years ago

@kailigo it is a good point. I do have the PDFs, but need to confirm with our legal team if I can share them because the original discussions were only around images.

kailigo commented 5 years ago

@zhxgj Thanks. Look forward to hear your update.

kailigo commented 5 years ago

Hi @zhxgj . It seems that the pdf files are publicly available. I can download it by myself if the legal permission takes a long time. So, could you provide me some help on which pdf files you used in your dataset, for example, providing some index or file names. Thanks very much.

kailigo commented 5 years ago

Hi, @zhxgj , I have downloaded the raw pdfs from PMC webiste. How can I find the correspondence between pdf files and the images. It is apparently that you renamed the images when they are generated from pdf files. Could you tell your rules of naming the images. Thanks.

zhxgj commented 5 years ago

Hi @kailigo It is great that you managed to download the pdfs. The filenames of the images in PubLayNet is formatted as "_.pdf"

dwalton76 commented 5 years ago

@zhxgj any update on making the pdfs available for download? Thanks!

zhxgj commented 5 years ago

@dwalton76 We are working with our legal team to host the data on a different platform which supports aws s3 API. The pdf pages are part of the conversation. Once all the legal processes are approved, I think we should have the current data and the PDF pages available on the new platform.

ghost commented 4 years ago

@kailigo Upload a link to your downloaded pdf dataset

theCodinCowboy commented 4 years ago

Did y'all ever release the PDF dataset that corresponds to the images and annotations?

zhxgj commented 4 years ago

Hi @theCodinCowboy , I have the PDFs ready to release. I will follow up with our legal team to double check if I can share them

theCodinCowboy commented 4 years ago

What's the ETA on when legal will have a judgement @zhxgj? I was planning to use the PDF's this week if possible. If not I am going to try to get them myself from PMC.

zhxgj commented 4 years ago

What's the ETA on when legal will have a judgement @zhxgj? I was planning to use the PDF's this week if possible. If not I am going to try to get them myself from PMC.

We have submitted a ticket for approval. Normally this is reviewed quickly, but I do not think it will be done this week. My best guess is in two weeks.

theCodinCowboy commented 4 years ago

Hi @zhxgj just following up on this request. Any updates?

YueshangGu commented 4 years ago

Hi, @theCodinCowboy I have downloaded some raw pdfs from PMC, but some papers had been retracted or been updated. I have not found these old version pdfs. Can you find these papers? @zhxgj Could you release all raw pdfs you used in this dataset? It's difficult to download retracted or updated papers and located all updated papers in this dataset. Thanks~

zhxgj commented 4 years ago

Hi @theCodinCowboy @YueshangGu our legal team has approved to release the PDFs. I am chasing up our open data team to get the PDFs online. It shouldn't take long.

YueshangGu commented 4 years ago

@zhxgj Thank you for your help. In addition to raw pdfs, could you release all raw xml files you used for these dataset? Thanks~

zhxgj commented 4 years ago

The PDFs of the document pages in PubLayNet are released

zhxgj commented 4 years ago

@zhxgj Thank you for your help. In addition to raw pdfs, could you release all raw xml files you used for these dataset? Thanks~

@ajjimeno would you please be able to check with @kmh4321 on this?