Open petermr opened 4 years ago
OMAR> Lastly I tried adding 3 more scholarly articles but when I ran ami-pdf -p test in the omar directory it returned the following:
Generic values (AMIPDFTool)
================================
-v to see generic values
oldstyle true
Specific values (AMIPDFTool)
================================
maxpages 5
svgDirectoryName svg/
outputSVG true
imgDirectoryName pdfimages/
outputPDFImages true
AMIPDFTool cTree: He_Deep_Residual_Learning_CVPR_2016_paper
cTree: He_Deep_Residual_Learning_CVPR_2016_paper
make skipped AMIPDFTool cTree: Scalable_Nearest_Neighbor_Algorithms
cTree: Scalable_Nearest_Neighbor_Algorithms
make skipped AMIPDFTool cTree: Simultaneous_Detection_and_Segmentation
cTree: Simultaneous_Detection_and_Segmentation
make skipped AMIPDFTool cTree: lichtenburg19a
cTree: lichtenburg19a
make skipped
and didn't do anything (I'm guessing it skipped them all might have placed them in the wrong way) I just pushed them onto the github so when you have some time can you check if I placed them correctly and tell me if I should post this onto issues?
Please use 3 backticks to quote machine ouptut (this formats it in monospace)
This is used to create a project from raw files (PDF,HTML,XML) - mainly PDF.
Did you create the fulltext.pdf
using ami-makeproject
? If so, fine.
This iterates over a project and runs the PDF conversion on each file. My output is
pm286macbook:omar pm286$ ami-pdf -p test
Generic values (AMIPDFTool)
================================
-v to see generic values
oldstyle true
Specific values (AMIPDFTool)
================================
maxpages 5
svgDirectoryName svg/
outputSVG true
imgDirectoryName pdfimages/
outputPDFImages true
AMIPDFTool cTree: He_Deep_Residual_Learning_CVPR_2016_paper
cTree: He_Deep_Residual_Learning_CVPR_2016_paper
max pages: 5 0
pages include: [0, 1, 2, 3, 4]
[1][2]0 [main] WARN org.apache.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+1 (1) in font TRGNUN+MinionPro-Regular
0 [main] WARN org.apache.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+1 (1) in font TRGNUN+MinionPro-Regular
??[3][4][5]????????????????1461 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 155551
1461 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 155551
5
pages include: [5, 6, 7, 8, 9]
[6][7][8][9]AMIPDFTool cTree: Scalable_Nearest_Neighbor_Algorithms
cTree: Scalable_Nearest_Neighbor_Algorithms
max pages: 5 0
pages include: [0, 1, 2, 3, 4]
[1]???[2]????????????[3][4][5][.0]?6229 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 168957
6229 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 168957
6718 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 183889
6718 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 183889
img 5
pages include: [5, 6, 7, 8, 9]
[6][7][8]?[.0][9]?[10]9094 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 172079
9094 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 172079
10539 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 116308
10539 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 116308
11379 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 180782
11379 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 180782
img 10
pages include: [10, 11, 12, 13, 14]
[11][12][13]?[14][.0][.1]13843 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 114342
13843 [main] DEBUG org.contentmine.graphics.svg.SVGPath - skipped long DString: 114342
img img AMIPDFTool cTree: Simultaneous_Detection_and_Segmentation
cTree: Simultaneous_Detection_and_Segmentation
max pages: 5 0
pages include: [0, 1, 2, 3, 4]
[1][2][3][.0][.1][.2][.3][.4][.5][.6][.7][4][5] img img img img img img img img 5
pages include: [5, 6, 7, 8, 9]
[6][7][.0][.1][.2][8][.0][.1][.2][.3][.4][.5][.6][.7][.8][.9][.10][.11][.12][.13][.14][.15][.16][.17][9][10] img img img img img img img img img img img img img img img img img img img img img 10
pages include: [10, 11, 12, 13, 14]
[11][12][13][14][.0][.1][.2][.3][.4][.5][.6][.7][.8][.9][.10][.11][.12][.13][.14][.15][.16][.17][15]23112 [main] DEBUG org.contentmine.graphics.svg.path.SVGPathParser - longParse: 5; d 83161
23112 [main] DEBUG org.contentmine.graphics.svg.path.SVGPathParser - longParse: 5; d 83161
img img img img img img img img img img img img img img img img img img 15
pages include: [15, 16, 17, 18, 19]
[16]AMIPDFTool cTree: lichtenburg19a
cTree: lichtenburg19a
pm286macbook:omar pm286$
It looks fine to me. Did it not do this for you?
I have surveyed it with "tree"
pm286macbook:omar pm286$ tree test
test
├── He_Deep_Residual_Learning_CVPR_2016_paper
│ ├── fulltext.pdf
│ ├── pdfimages
│ └── svg
│ ├── fulltext-page.0.svg
│ ├── fulltext-page.1.svg
│ ├── fulltext-page.2.svg
│ ├── fulltext-page.3.svg
│ ├── fulltext-page.4.svg
│ ├── fulltext-page.5.svg
│ ├── fulltext-page.6.svg
│ ├── fulltext-page.7.svg
│ └── fulltext-page.8.svg
├── Scalable_Nearest_Neighbor_Algorithms
│ ├── fulltext.pdf
│ ├── pdfimages
│ │ ├── image.14.1.289_361.73_163.png
│ │ ├── image.14.2.289_361.206_296.png
│ │ ├── image.5.1.34_533.80_243.png
│ │ └── image.8.1.34_533.80_173.png
│ └── svg
│ ├── fulltext-page.0.svg
│ ├── fulltext-page.1.svg
│ ├── fulltext-page.10.svg
│ ├── fulltext-page.11.svg
│ ├── fulltext-page.12.svg
│ ├── fulltext-page.13.svg
│ ├── fulltext-page.2.svg
│ ├── fulltext-page.3.svg
│ ├── fulltext-page.4.svg
│ ├── fulltext-page.5.svg
│ ├── fulltext-page.6.svg
│ ├── fulltext-page.7.svg
│ ├── fulltext-page.8.svg
│ └── fulltext-page.9.svg
├── Simultaneous_Detection_and_Segmentation
│ ├── fulltext.pdf
│ ├── pdfimages
│ │ ├── image.14.1.92_121.543_582.png
│ │ ├── image.14.10.208_257.583_620.png
│ │ ├── image.14.11.260_308.583_620.png
│ │ ├── image.14.12.311_360.583_620.png
│ │ ├── image.14.13.54_102.625_657.png
│ │ ├── image.14.14.105_154.625_657.png
│ │ ├── image.14.15.157_205.621_657.png
│ │ ├── image.14.16.208_257.621_657.png
│ │ ├── image.14.17.260_308.625_657.png
│ │ ├── image.14.18.311_360.625_657.png
│ │ ├── image.14.2.125_154.543_582.png
│ │ ├── image.14.3.157_205.546_582.png
│ │ ├── image.14.4.208_257.546_582.png
│ │ ├── image.14.5.260_289.543_582.png
│ │ ├── image.14.6.292_322.543_582.png
│ │ ├── image.14.7.54_102.583_620.png
│ │ ├── image.14.8.105_154.583_620.png
│ │ ├── image.14.9.157_205.583_620.png
│ │ ├── image.3.1.40_89.554_588.png
│ │ ├── image.3.2.104_148.538_560.png
│ │ ├── image.3.3.104_148.560_582.png
│ │ ├── image.3.4.104_148.582_604.png
│ │ ├── image.3.5.162_188.530_556.png
│ │ ├── image.3.6.162_188.579_605.png
│ │ ├── image.3.7.275_320.556_586.png
│ │ ├── image.3.8.334_379.556_586.png
│ │ ├── image.7.1.70_117.249_280.png
│ │ ├── image.7.2.125_162.222_259.png
│ │ ├── image.7.3.126_163.271_308.png
│ │ ├── image.8.1.34_86.219_258.png
│ │ ├── image.8.10.199_251.259_298.png
│ │ ├── image.8.11.254_306.259_298.png
│ │ ├── image.8.12.309_361.259_298.png
│ │ ├── image.8.13.34_86.299_338.png
│ │ ├── image.8.14.89_141.299_338.png
│ │ ├── image.8.15.144_196.299_338.png
│ │ ├── image.8.16.199_251.299_338.png
│ │ ├── image.8.17.254_306.299_338.png
│ │ ├── image.8.18.309_361.299_338.png
│ │ ├── image.8.2.89_141.219_258.png
│ │ ├── image.8.3.144_196.219_258.png
│ │ ├── image.8.4.199_251.219_258.png
│ │ ├── image.8.5.254_306.219_258.png
│ │ ├── image.8.6.309_361.219_258.png
│ │ ├── image.8.7.34_86.259_298.png
│ │ ├── image.8.8.89_141.259_298.png
│ │ └── image.8.9.144_196.259_298.png
│ └── svg
│ ├── fulltext-page.0.svg
│ ├── fulltext-page.1.svg
│ ├── fulltext-page.10.svg
│ ├── fulltext-page.11.svg
│ ├── fulltext-page.12.svg
│ ├── fulltext-page.13.svg
│ ├── fulltext-page.14.svg
│ ├── fulltext-page.15.svg
│ ├── fulltext-page.2.svg
│ ├── fulltext-page.3.svg
│ ├── fulltext-page.4.svg
│ ├── fulltext-page.5.svg
│ ├── fulltext-page.6.svg
│ ├── fulltext-page.7.svg
│ ├── fulltext-page.8.svg
│ └── fulltext-page.9.svg
├── lichtenburg19a
│ ├── fulltext.pdf
│ ├── fulltext.png
│ └── svg
│ ├── fulltext-page.0.svg
│ ├── fulltext-page.1.svg
│ ├── fulltext-page.2.svg
│ ├── fulltext-page.3.svg
│ ├── fulltext-page.4.svg
│ ├── fulltext-page.5.svg
│ ├── fulltext-page.6.svg
│ ├── fulltext-page.7.svg
│ ├── fulltext-page.8.svg
│ ├── fulltext-page.9.svg
│ ├── page1.graph1.svg
│ ├── page1.graph2.svg
│ ├── page1.graph3.svg
│ └── page1.graphs.svg
Thanks @petermr this worked for me, but I can't work out how to get from this to having some scholarly.html
text to work with. Is the OCR route the only route? Can I go from SVG to HTML?
Great that it worked.
What are you trying to do?
The SVG contains the text in character to character form.
But the best approach may be ami-grobid
which runs the Grobid package.
and if you are able to help in any.documentation that would be great. It's sparse in places.
I'm trying find theses and articles that refer to Coronavirus etc by running ami-search
over them with the openVirus dictionaries. I have fulltext.pdf
files, but I believe I need scholarly.html
files to run ami-search
(it says it can't find any text).
I ran ami-pdf
fine, and got the SVG and images (as above), and I've also tried using ami-ocr
to process the images. But neither of these has left me with any scholarly.html
files.
I'll try ami-grobid
next, and yes, I'll write up the process I'm following.
Managed to get grobid
running and it generated some TEI:
# ls -lct ami-test/example2/tei/
-rw-r--r--. 1 root root 904616 Mar 26 12:38 fulltext.tei.html
-rw-r--r--. 1 root root 990532 Mar 26 12:36 fulltext.tei.xml
drwxr-xr-x. 2 root root 4096 Mar 26 12:35 fulltext_assets
...but ami-search
still says no words found to extract
.
Andy, That's really great!!
I have always wanted to use dissertations. As you are probably aware every university has a different format.
I used to have an SVG2HTML tool . If you are doing a lot of this then I'd see what can be resurrected. Does the tei.html display OK? I'd guess that renaming fulltext.tei.html to scholarly.html might work. Not as a production tool. P.
On Thu, Mar 26, 2020 at 12:41 PM Andy Jackson notifications@github.com wrote:
Managed to get grobid running and it generated some TEI:
ls -lct ami-test/example2/tei/
-rw-r--r--. 1 root root 904616 Mar 26 12:38 fulltext.tei.html -rw-r--r--. 1 root root 990532 Mar 26 12:36 fulltext.tei.xml drwxr-xr-x. 2 root root 4096 Mar 26 12:35 fulltext_assets
...but ami-search still says no words found to extract.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/ami3/issues/10#issuecomment-604408670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS3YRNHUGYK2GXP5F4TRJNEPZANCNFSM4J56OKUQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Thanks, that's very helpful! I get meaningful results after doing a cp tei/fulltext.tei.html scholarly.html
.
I'm happy trying things out and handling conversions. The bit I'm struggling with is the overall workflow stages and any constraints on how the data is suppose to be. Example questions:
scholarly.html
but the current AMI3 project requires me to pick one workflow and copy the result into place as scholarly.html
, right?ami-search
works happily with plain text fulltext.pdf.txt
or HTML scholarly.html
, but nothing else?scholarly.html
was referring to this proposal but I think not?)ami-search
generates plain XML - how does this relate to generating WikiData content?ami-search
takes keywords relating to specific concepts (dictionaries) and records where they appear in a set of texts, right?I realise the answers might be 'it depends' because I know you use this toolset/approach for a range of different fact extraction workflows. So my apologies if off track.
Thanks - these are wonderful questions. No need to apologize. AMI has been built in an ad hoc fashion from a variety of projects and often not finished. So I tested that GROBI worked but at that stage no one was interested in using it. I'll hack ami-Grobid to output scholarly.html. Hoping that everything will move to using something similar to JATS (which has a subset of HTML)
On Thu, Mar 26, 2020 at 3:42 PM Andy Jackson notifications@github.com wrote:
Thanks, that's very helpful! I get meaningful results after doing a cp tei/fulltext.tei.html scholarly.html.
I'm happy trying things out and handling conversions.
Great. A lot of this is in the system but not well documented one place to start is my slides https://www.slideshare.net/petermurrayrust/text-and-data-mining-explained-at-ftdm and a lot of other similar ones
The bit I'm struggling with is the overall workflow stages and any constraints on how the data is suppose to be. Example questions:
- The OCR/Grobid workflows appear to target the generation of scholarly.html but the current AMI3 project requires me to pick one workflow and copy the result into place as scholarly.html, right?
I will try to hack grobid tonight.
- Am I right to think that ami-search works happily with plain text fulltext.pdf.txt or HTML scholarly.html, but nothing else?
I think so
- Will any reasonable plain text or HTML representation suffice? (e.g at first I thought scholarly.html was referring to this proposal https://w3c.github.io/scholarly-html/ but I think not?)
Any Html will work as long as its well-formed - I think. Scholarly HTML hasn't really taken off and I think the HTML subset in JATS will work.
- It seems ami-search generates plain XML - how does this relate to generating WikiData content?
AMI search uses Wikidata. It doesn't currently generate Wikidata.
- I think ami-search takes keywords relating to specific concepts (dictionaries) and records where they appear in a set of texts, right?
Yes, absolutely.
I realise the answers might be 'it depends' because I know you use this toolset/approach for a range of different fact extraction workflows. So my apologies if off track.
Anything you can do to document he you are using AMI and whether it works will be very useful.
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
It's possible to go from SVG to HTML but it's fragmented and sometimes garbled. We also use GROBID.
Have invited you to Slack.
How to add more PDFs to an existing project.