ibm-aur-nlp / PubLayNet

Other
900 stars 165 forks source link

Add additional categories #15

Open duklin opened 4 years ago

duklin commented 4 years ago

Is it possible to start training with additional categories such as: heading2, heading3, ..., image description, ...?

zhxgj commented 4 years ago

It is possible to do that. Level of headings and caption of image/tables are in the xml files. It is possible to link them to the PDFs. But we currently do not have a plan to do it due to other commitments.

dijana-sagit commented 4 years ago

Hi @zhxgj, thank you for your research and for providing such a useful resource! I was wondering if you would be allowed to share the original XMLs of the PDF files collected from PubMed, or the file IDs so I can re-collect them myself in order to add extra classes? Regards

zhxgj commented 4 years ago

Hi @dijana-sagit , thanks for your interest. The XML and PDF files can be download directly from the PubMed Central Open Access Subset via FTP. Here is their link: https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/