The scripts/code used to match the PDF miner outputs on documents to the XML representations

abirami005 commented 4 years ago

Do you provide the scripts/code that you developed to match the PDFMiner outputs on the documents to the XML representation of the PDF page itself? Thanks

zhxgj commented 4 years ago

We cannot open source the code at the moment as it is related to our IP protection.

bertsky commented 4 years ago

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

zhxgj commented 4 years ago

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

Em, I did not think of it before. Let me have a check along our legal approval chain.

pollyMath commented 4 years ago

I assume this means that providing only the code for extracting annotations from XML representation is also not possible at the moment?

zhxgj commented 4 years ago

@pollyMath Unfortunately that is what our IP lawyer told us.

bertsky commented 3 years ago

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

Em, I did not think of it before. Let me have a check along our legal approval chain.

@zhxgj Did your lawyers reach a verdict regarding the publication of PDF/XML alignment data?

Note: This is relevant to a number of potential applications of this corpus, for which some choices made in the COCO format would be incompatible or suboptimal, e.g.

definition/granularity of region classes
not annotating headers and footers
not including reading order of regions
not including text lines (contours / baselines)
not including text content (plain) and text style (formatting)

ajjimeno commented 3 years ago

Unfortunately not yet. I understand the benefits, but we cannot release it yet. Thanks for your understanding.

On Tue, Jan 12, 2021 at 3:49 AM Robert Sachunsky notifications@github.com wrote:

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

Em, I did not think of it before. Let me have a check along our legal approval chain.

@zhxgj https://github.com/zhxgj Did your lawyers reach a verdict regarding the publication of PDF/XML alignment data?

Note: This is relevant to a number of potential applications of this corpus, for which some choices made in the COCO format would be incompatible or suboptimal, e.g.

definition/granularity of region classes

not annotating headers and footers

not including reading order of regions

not including text lines (contours / baselines)

not including text content (plain) and text style (formatting)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ibm-aur-nlp/PubLayNet/issues/20#issuecomment-758080136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6BZDOMQJ545RQ35QSAHDLSZMTXZANCNFSM4K34F7UA .

ibm-aur-nlp / PubLayNet

The scripts/code used to match the PDF miner outputs on documents to the XML representations #20