petermr / ami3

Integration of cephis and normami code into a single base. Tests will be slimmed down
Apache License 2.0
17 stars 5 forks source link

Adding PDFs to exisiting project #10

Open petermr opened 4 years ago

petermr commented 4 years ago

How to add more PDFs to an existing project.

petermr commented 4 years ago

OMAR> Lastly I tried adding 3 more scholarly articles but when I ran ami-pdf -p test in the omar directory it returned the following:

Generic values (AMIPDFTool)
================================
-v to see generic values
oldstyle            true

Specific values (AMIPDFTool)
================================
maxpages            5
svgDirectoryName    svg/
outputSVG           true
imgDirectoryName    pdfimages/
outputPDFImages     true
AMIPDFTool cTree: He_Deep_Residual_Learning_CVPR_2016_paper
cTree: He_Deep_Residual_Learning_CVPR_2016_paper
make skipped AMIPDFTool cTree: Scalable_Nearest_Neighbor_Algorithms
cTree: Scalable_Nearest_Neighbor_Algorithms
make skipped AMIPDFTool cTree: Simultaneous_Detection_and_Segmentation
cTree: Simultaneous_Detection_and_Segmentation
make skipped AMIPDFTool cTree: lichtenburg19a
cTree: lichtenburg19a
make skipped

and didn't do anything (I'm guessing it skipped them all might have placed them in the wrong way) I just pushed them onto the github so when you have some time can you check if I placed them correctly and tell me if I should post this onto issues?

petermr commented 4 years ago

Note

Please use 3 backticks to quote machine ouptut (this formats it in monospace)

ami-makeproject

This is used to create a project from raw files (PDF,HTML,XML) - mainly PDF.
Did you create the fulltext.pdf using ami-makeproject? If so, fine.

ami-pdf

This iterates over a project and runs the PDF conversion on each file. My output is

pm286macbook:omar pm286$ ami-pdf -p test

Generic values (AMIPDFTool)
================================
-v to see generic values
oldstyle            true

Specific values (AMIPDFTool)
================================
maxpages            5
svgDirectoryName    svg/
outputSVG           true
imgDirectoryName    pdfimages/
outputPDFImages     true
AMIPDFTool cTree: He_Deep_Residual_Learning_CVPR_2016_paper
cTree: He_Deep_Residual_Learning_CVPR_2016_paper
 max pages: 5 0 
pages include: [0, 1, 2, 3, 4]
[1][2]0    [main] WARN  org.apache.pdfbox.pdmodel.font.PDType0Font  - No Unicode mapping for CID+1 (1) in font TRGNUN+MinionPro-Regular
0 [main] WARN org.apache.pdfbox.pdmodel.font.PDType0Font  - No Unicode mapping for CID+1 (1) in font TRGNUN+MinionPro-Regular
??[3][4][5]????????????????1461 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 155551
1461 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 155551
 5 
pages include: [5, 6, 7, 8, 9]
[6][7][8][9]AMIPDFTool cTree: Scalable_Nearest_Neighbor_Algorithms
cTree: Scalable_Nearest_Neighbor_Algorithms
 max pages: 5 0 
pages include: [0, 1, 2, 3, 4]
[1]???[2]????????????[3][4][5][.0]?6229 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 168957
6229 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 168957
6718 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 183889
6718 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 183889
 img  5 
pages include: [5, 6, 7, 8, 9]
[6][7][8]?[.0][9]?[10]9094 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 172079
9094 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 172079
10539 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 116308
10539 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 116308
11379 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 180782
11379 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 180782
 img  10 
pages include: [10, 11, 12, 13, 14]
[11][12][13]?[14][.0][.1]13843 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 114342
13843 [main] DEBUG org.contentmine.graphics.svg.SVGPath  - skipped long DString: 114342
 img  img AMIPDFTool cTree: Simultaneous_Detection_and_Segmentation
cTree: Simultaneous_Detection_and_Segmentation
 max pages: 5 0 
pages include: [0, 1, 2, 3, 4]
[1][2][3][.0][.1][.2][.3][.4][.5][.6][.7][4][5] img  img  img  img  img  img  img  img  5 
pages include: [5, 6, 7, 8, 9]
[6][7][.0][.1][.2][8][.0][.1][.2][.3][.4][.5][.6][.7][.8][.9][.10][.11][.12][.13][.14][.15][.16][.17][9][10] img  img  img  img  img  img  img  img  img  img  img  img  img  img  img  img  img  img  img  img  img  10 
pages include: [10, 11, 12, 13, 14]
[11][12][13][14][.0][.1][.2][.3][.4][.5][.6][.7][.8][.9][.10][.11][.12][.13][.14][.15][.16][.17][15]23112 [main] DEBUG org.contentmine.graphics.svg.path.SVGPathParser  - longParse: 5; d 83161
23112 [main] DEBUG org.contentmine.graphics.svg.path.SVGPathParser  - longParse: 5; d 83161
 img  img  img  img  img  img  img  img  img  img  img  img  img  img  img  img  img  img  15 
pages include: [15, 16, 17, 18, 19]
[16]AMIPDFTool cTree: lichtenburg19a
cTree: lichtenburg19a
pm286macbook:omar pm286$ 

It looks fine to me. Did it not do this for you?

I have surveyed it with "tree"

pm286macbook:omar pm286$ tree test
test
├── He_Deep_Residual_Learning_CVPR_2016_paper
│   ├── fulltext.pdf
│   ├── pdfimages
│   └── svg
│       ├── fulltext-page.0.svg
│       ├── fulltext-page.1.svg
│       ├── fulltext-page.2.svg
│       ├── fulltext-page.3.svg
│       ├── fulltext-page.4.svg
│       ├── fulltext-page.5.svg
│       ├── fulltext-page.6.svg
│       ├── fulltext-page.7.svg
│       └── fulltext-page.8.svg
├── Scalable_Nearest_Neighbor_Algorithms
│   ├── fulltext.pdf
│   ├── pdfimages
│   │   ├── image.14.1.289_361.73_163.png
│   │   ├── image.14.2.289_361.206_296.png
│   │   ├── image.5.1.34_533.80_243.png
│   │   └── image.8.1.34_533.80_173.png
│   └── svg
│       ├── fulltext-page.0.svg
│       ├── fulltext-page.1.svg
│       ├── fulltext-page.10.svg
│       ├── fulltext-page.11.svg
│       ├── fulltext-page.12.svg
│       ├── fulltext-page.13.svg
│       ├── fulltext-page.2.svg
│       ├── fulltext-page.3.svg
│       ├── fulltext-page.4.svg
│       ├── fulltext-page.5.svg
│       ├── fulltext-page.6.svg
│       ├── fulltext-page.7.svg
│       ├── fulltext-page.8.svg
│       └── fulltext-page.9.svg
├── Simultaneous_Detection_and_Segmentation
│   ├── fulltext.pdf
│   ├── pdfimages
│   │   ├── image.14.1.92_121.543_582.png
│   │   ├── image.14.10.208_257.583_620.png
│   │   ├── image.14.11.260_308.583_620.png
│   │   ├── image.14.12.311_360.583_620.png
│   │   ├── image.14.13.54_102.625_657.png
│   │   ├── image.14.14.105_154.625_657.png
│   │   ├── image.14.15.157_205.621_657.png
│   │   ├── image.14.16.208_257.621_657.png
│   │   ├── image.14.17.260_308.625_657.png
│   │   ├── image.14.18.311_360.625_657.png
│   │   ├── image.14.2.125_154.543_582.png
│   │   ├── image.14.3.157_205.546_582.png
│   │   ├── image.14.4.208_257.546_582.png
│   │   ├── image.14.5.260_289.543_582.png
│   │   ├── image.14.6.292_322.543_582.png
│   │   ├── image.14.7.54_102.583_620.png
│   │   ├── image.14.8.105_154.583_620.png
│   │   ├── image.14.9.157_205.583_620.png
│   │   ├── image.3.1.40_89.554_588.png
│   │   ├── image.3.2.104_148.538_560.png
│   │   ├── image.3.3.104_148.560_582.png
│   │   ├── image.3.4.104_148.582_604.png
│   │   ├── image.3.5.162_188.530_556.png
│   │   ├── image.3.6.162_188.579_605.png
│   │   ├── image.3.7.275_320.556_586.png
│   │   ├── image.3.8.334_379.556_586.png
│   │   ├── image.7.1.70_117.249_280.png
│   │   ├── image.7.2.125_162.222_259.png
│   │   ├── image.7.3.126_163.271_308.png
│   │   ├── image.8.1.34_86.219_258.png
│   │   ├── image.8.10.199_251.259_298.png
│   │   ├── image.8.11.254_306.259_298.png
│   │   ├── image.8.12.309_361.259_298.png
│   │   ├── image.8.13.34_86.299_338.png
│   │   ├── image.8.14.89_141.299_338.png
│   │   ├── image.8.15.144_196.299_338.png
│   │   ├── image.8.16.199_251.299_338.png
│   │   ├── image.8.17.254_306.299_338.png
│   │   ├── image.8.18.309_361.299_338.png
│   │   ├── image.8.2.89_141.219_258.png
│   │   ├── image.8.3.144_196.219_258.png
│   │   ├── image.8.4.199_251.219_258.png
│   │   ├── image.8.5.254_306.219_258.png
│   │   ├── image.8.6.309_361.219_258.png
│   │   ├── image.8.7.34_86.259_298.png
│   │   ├── image.8.8.89_141.259_298.png
│   │   └── image.8.9.144_196.259_298.png
│   └── svg
│       ├── fulltext-page.0.svg
│       ├── fulltext-page.1.svg
│       ├── fulltext-page.10.svg
│       ├── fulltext-page.11.svg
│       ├── fulltext-page.12.svg
│       ├── fulltext-page.13.svg
│       ├── fulltext-page.14.svg
│       ├── fulltext-page.15.svg
│       ├── fulltext-page.2.svg
│       ├── fulltext-page.3.svg
│       ├── fulltext-page.4.svg
│       ├── fulltext-page.5.svg
│       ├── fulltext-page.6.svg
│       ├── fulltext-page.7.svg
│       ├── fulltext-page.8.svg
│       └── fulltext-page.9.svg
├── lichtenburg19a
│   ├── fulltext.pdf
│   ├── fulltext.png
│   └── svg
│       ├── fulltext-page.0.svg
│       ├── fulltext-page.1.svg
│       ├── fulltext-page.2.svg
│       ├── fulltext-page.3.svg
│       ├── fulltext-page.4.svg
│       ├── fulltext-page.5.svg
│       ├── fulltext-page.6.svg
│       ├── fulltext-page.7.svg
│       ├── fulltext-page.8.svg
│       ├── fulltext-page.9.svg
│       ├── page1.graph1.svg
│       ├── page1.graph2.svg
│       ├── page1.graph3.svg
│       └── page1.graphs.svg
anjackson commented 4 years ago

Thanks @petermr this worked for me, but I can't work out how to get from this to having some scholarly.html text to work with. Is the OCR route the only route? Can I go from SVG to HTML?

petermr commented 4 years ago

Great that it worked. What are you trying to do? The SVG contains the text in character to character form. But the best approach may be ami-grobid which runs the Grobid package.

and if you are able to help in any.documentation that would be great. It's sparse in places.

anjackson commented 4 years ago

I'm trying find theses and articles that refer to Coronavirus etc by running ami-search over them with the openVirus dictionaries. I have fulltext.pdf files, but I believe I need scholarly.html files to run ami-search (it says it can't find any text).

I ran ami-pdf fine, and got the SVG and images (as above), and I've also tried using ami-ocr to process the images. But neither of these has left me with any scholarly.html files.

I'll try ami-grobid next, and yes, I'll write up the process I'm following.

anjackson commented 4 years ago

Managed to get grobid running and it generated some TEI:

# ls -lct ami-test/example2/tei/
-rw-r--r--. 1 root root 904616 Mar 26 12:38 fulltext.tei.html
-rw-r--r--. 1 root root 990532 Mar 26 12:36 fulltext.tei.xml
drwxr-xr-x. 2 root root   4096 Mar 26 12:35 fulltext_assets

...but ami-search still says no words found to extract.

petermr commented 4 years ago

Andy, That's really great!!

I have always wanted to use dissertations. As you are probably aware every university has a different format.

I used to have an SVG2HTML tool . If you are doing a lot of this then I'd see what can be resurrected. Does the tei.html display OK? I'd guess that renaming fulltext.tei.html to scholarly.html might work. Not as a production tool. P.

On Thu, Mar 26, 2020 at 12:41 PM Andy Jackson notifications@github.com wrote:

Managed to get grobid running and it generated some TEI:

ls -lct ami-test/example2/tei/

-rw-r--r--. 1 root root 904616 Mar 26 12:38 fulltext.tei.html -rw-r--r--. 1 root root 990532 Mar 26 12:36 fulltext.tei.xml drwxr-xr-x. 2 root root 4096 Mar 26 12:35 fulltext_assets

...but ami-search still says no words found to extract.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/ami3/issues/10#issuecomment-604408670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS3YRNHUGYK2GXP5F4TRJNEPZANCNFSM4J56OKUQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

anjackson commented 4 years ago

Thanks, that's very helpful! I get meaningful results after doing a cp tei/fulltext.tei.html scholarly.html.

I'm happy trying things out and handling conversions. The bit I'm struggling with is the overall workflow stages and any constraints on how the data is suppose to be. Example questions:

I realise the answers might be 'it depends' because I know you use this toolset/approach for a range of different fact extraction workflows. So my apologies if off track.

petermr commented 4 years ago

Thanks - these are wonderful questions. No need to apologize. AMI has been built in an ad hoc fashion from a variety of projects and often not finished. So I tested that GROBI worked but at that stage no one was interested in using it. I'll hack ami-Grobid to output scholarly.html. Hoping that everything will move to using something similar to JATS (which has a subset of HTML)

On Thu, Mar 26, 2020 at 3:42 PM Andy Jackson notifications@github.com wrote:

Thanks, that's very helpful! I get meaningful results after doing a cp tei/fulltext.tei.html scholarly.html.

I'm happy trying things out and handling conversions.

Great. A lot of this is in the system but not well documented one place to start is my slides https://www.slideshare.net/petermurrayrust/text-and-data-mining-explained-at-ftdm and a lot of other similar ones

The bit I'm struggling with is the overall workflow stages and any constraints on how the data is suppose to be. Example questions:

  • The OCR/Grobid workflows appear to target the generation of scholarly.html but the current AMI3 project requires me to pick one workflow and copy the result into place as scholarly.html, right?

I will try to hack grobid tonight.

  • Am I right to think that ami-search works happily with plain text fulltext.pdf.txt or HTML scholarly.html, but nothing else?

I think so

  • Will any reasonable plain text or HTML representation suffice? (e.g at first I thought scholarly.html was referring to this proposal https://w3c.github.io/scholarly-html/ but I think not?)

Any Html will work as long as its well-formed - I think. Scholarly HTML hasn't really taken off and I think the HTML subset in JATS will work.

  • It seems ami-search generates plain XML - how does this relate to generating WikiData content?

AMI search uses Wikidata. It doesn't currently generate Wikidata.

  • I think ami-search takes keywords relating to specific concepts (dictionaries) and records where they appear in a set of texts, right?

Yes, absolutely.

I realise the answers might be 'it depends' because I know you use this toolset/approach for a range of different fact extraction workflows. So my apologies if off track.

Anything you can do to document he you are using AMI and whether it works will be very useful.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

It's possible to go from SVG to HTML but it's fragmented and sometimes garbled. We also use GROBID.

Have invited you to Slack.