Question about Original XML

dsdoermann commented 3 years ago

Does anyone know if there is a parser available that was used to take the original PDF files and convert them to the PubMed Open format?

titipata commented 3 years ago

Hi @dsdoermann, there are no available PDF to original PubMed XML format as far as I know today. Unless you want to analyze only open access articles, you can download the bulk XML for the OA subsets.

For PDF articles, a tool such as GROBID can take the scientific PDF and return it in XML format. For Python wrapper for GROBID, I wrote a script here: https://github.com/titipata/scipdf_parser but not currently up-to-date.

dsdoermann commented 3 years ago

Thank you. Do you think they created all the XML by hand for OpenAccess? Im interested in just having something (since you already have the XML parser!!! And Thank You!) where I can segment the PDF, get the figures, etc.

David Doermann Professor, SUNY Empire Innovation Professor, Department of Computer Science and Engineering University at Buffalo

doermann@buffalo.edu cse.buffalo.edu/~doermannhttp://cse.buffalo.edu/~doermann Office: 716-645-1557 Mobile: 410-493-9043

From: Titipat Achakulvisut notifications@github.com Sent: Friday, January 15, 2021 11:41 AM To: titipata/pubmed_parser pubmed_parser@noreply.github.com Cc: Doermann, David doermann@buffalo.edu; Mention mention@noreply.github.com Subject: Re: [titipata/pubmed_parser] Question about Original XML (#96)

Hi @dsdoermannhttps://github.com/dsdoermann, there are no available PDF to original PubMed XML format as far as I know today. Unless you want to analyze only open access articles, you can download the bulk XML for the OA subsets.

For PDF articles, a tool such as GROBID can take the scientific PDF and return it in XML format. For Python wrapper for GROBID, I wrote a script here: https://github.com/titipata/scipdf_parser but not currently up-to-date.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/titipata/pubmed_parser/issues/96#issuecomment-761050193, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKONP72Y5763ZPKHVWCJX6LS2BV3FANCNFSM4WEFVELA.

titipata commented 3 years ago

I'm not sure how open access XML is generated. But for OA subsets, you can directly download the figures corpus. I think I do have some documentation how to download the figures corpus for OA subsets somewhere on the repo, I can link to it soon! Here, you can extract data such as captions directly from the original XML file and link to the figures.

For PDF, I wrote a Python wrapper for pdffigure2 library from Sematic Scholar in scipdf_parser. You can extract those information directly from the PDF. For paragraphs and captions from PDF, to my knowledge , you still need GROBID to parse those information out.

I think Sematic Scholar might have some subsets of parsed PDF in JSON format openly available on there corpus.

Hope this help! Maybe someone will share more information on this issues.

titipata / pubmed_parser

Question about Original XML #96