petermr / tigr2ess

Materials for TIGR2ESS workshop in Delhi Feb 2019 - joint UK(Cambridge) - India project on Food Security.
Other
4 stars 10 forks source link

Dictionary creation - list of indian institutes. #33

Closed ambarishK closed 5 years ago

ambarishK commented 5 years ago

ambarish123@ubuntu:~$ ami-dictionary create --input https://en.wikipedia.org/wiki/Institutes_of_National_Importance --informat wikipage --dictionary Research_institute4 --outformats xml

Generic values (AMIDictionaryTool)

basename null cproject
ctree
cTreeList null dryrun false excludeBase null excludeTrees null file types [] forceMake false includeBase null includeTrees null log4j
logfile null verbose 0

Specific values (AMIDictionaryTool)

dataCols null dictionary [Research_institute4] dictionaryTop null href null hrefCols null input https://en.wikipedia.org/wiki/Institutes_of_National_Importance informat wikipage dictInformat null linkCol null log4j null nameCol null operation create outformats [xml] splitCol , termCol null terms null wikiLinks [wikipedia, wikidata] 0 [main] DEBUG org.contentmine.ami.tools.AMIDictionaryTool - extracting hyperlinks ..................!.!.!.!.!.!.!.....................!...................................................................................................................................................................................................!..!.............................................................................................................................................[Fatal Error] :302:913: The entity name must immediately follow the '&' in the entity reference. Exception in thread "main" java.lang.RuntimeException: cannot parse/read stream: at org.contentmine.eucl.xml.XMLUtil.parseQuietlyToDocument(XMLUtil.java:1176) at org.contentmine.eucl.xml.XMLUtil.parseQuietlyToRootElement(XMLUtil.java:1164) at org.contentmine.ami.tools.AMIDictionaryTool.addWikipedia(AMIDictionaryTool.java:786) at org.contentmine.ami.tools.AMIDictionaryTool.addWikipediaPage(AMIDictionaryTool.java:766) at org.contentmine.ami.tools.AMIDictionaryTool.addWikiLinks(AMIDictionaryTool.java:745) at org.contentmine.ami.tools.AMIDictionaryTool.createDictionaryListInRandomOrder(AMIDictionaryTool.java:733) at org.contentmine.ami.tools.AMIDictionaryTool.addEntriesToDictionaryElement(AMIDictionaryTool.java:717) at org.contentmine.ami.tools.AMIDictionaryTool.writeNamesAndLinks(AMIDictionaryTool.java:685) at org.contentmine.ami.tools.AMIDictionaryTool.createDictionary(AMIDictionaryTool.java:524) at org.contentmine.ami.tools.AMIDictionaryTool.runDictionary(AMIDictionaryTool.java:408) at org.contentmine.ami.tools.AMIDictionaryTool.runSpecifics(AMIDictionaryTool.java:397) at org.contentmine.ami.tools.AbstractAMITool.runCommands(AbstractAMITool.java:218) at org.contentmine.ami.tools.AMIDictionaryTool.main(AMIDictionaryTool.java:361) Caused by: nu.xom.ParsingException: The entity name must immediately follow the '&' in the entity reference. at line 302, column 913 at nu.xom.Builder.build(Unknown Source) at nu.xom.Builder.build(Unknown Source) at org.contentmine.eucl.xml.XMLUtil.parseQuietlyToDocument(XMLUtil.java:1174) ... 12 more Caused by: org.xml.sax.SAXParseException; lineNumber: 302; columnNumber: 913; The entity name must immediately follow the '&' in the entity reference. at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) ... 15 more

Error while creating dictionary for indian institutes.
petermr commented 5 years ago

Thank you Please reformat into monospace ( use <pre> tags).

mbarish123@ubuntu:~$ ami-dictionary create --input https://en.wikipedia.org/wiki/Institutes_of_National_Importance --informat wikipage --dictionary Research_institute4 --outformats xml

Generic values (AMIDictionaryTool)
================================
basename            null
cproject            
ctree               
cTreeList           null
dryrun              false
excludeBase         null
excludeTrees        null
file types          []
forceMake           false
includeBase         null
includeTrees        null
log4j               
logfile             null
verbose             0

Specific values (AMIDictionaryTool)
================================
dataCols      null
dictionary    [Research_institute4]
dictionaryTop     null
href          null
hrefCols      null
input         https://en.wikipedia.org/wiki/Institutes_of_National_Importance
informat      wikipage
dictInformat  null
linkCol       null
log4j         null
nameCol       null
operation     create
outformats    [xml]
splitCol      ,
termCol       null
terms         null
wikiLinks     [wikipedia, wikidata]
0    [main] DEBUG org.contentmine.ami.tools.AMIDictionaryTool  - extracting hyperlinks
..................!.!.!.!.!.!.!.....................!...................................................................................................................................................................................................!..!.............................................................................................................................................[Fatal Error] :302:913: The entity name must immediately follow the '&' in the entity reference.
Exception in thread "main" java.lang.RuntimeException: cannot parse/read stream: 
    at org.contentmine.eucl.xml.XMLUtil.parseQuietlyToDocument(XMLUtil.java:1176)
    at org.contentmine.eucl.xml.XMLUtil.parseQuietlyToRootElement(XMLUtil.java:1164)
    at org.contentmine.ami.tools.AMIDictionaryTool.addWikipedia(AMIDictionaryTool.java:786)
    at org.contentmine.ami.tools.AMIDictionaryTool.addWikipediaPage(AMIDictionaryTool.java:766)
    at org.contentmine.ami.tools.AMIDictionaryTool.addWikiLinks(AMIDictionaryTool.java:745)
    at org.contentmine.ami.tools.AMIDictionaryTool.createDictionaryListInRandomOrder(AMIDictionaryTool.java:733)
    at org.contentmine.ami.tools.AMIDictionaryTool.addEntriesToDictionaryElement(AMIDictionaryTool.java:717)
    at org.contentmine.ami.tools.AMIDictionaryTool.writeNamesAndLinks(AMIDictionaryTool.java:685)
    at org.contentmine.ami.tools.AMIDictionaryTool.createDictionary(AMIDictionaryTool.java:524)
    at org.contentmine.ami.tools.AMIDictionaryTool.runDictionary(AMIDictionaryTool.java:408)
    at org.contentmine.ami.tools.AMIDictionaryTool.runSpecifics(AMIDictionaryTool.java:397)
    at org.contentmine.ami.tools.AbstractAMITool.runCommands(AbstractAMITool.java:218)
    at org.contentmine.ami.tools.AMIDictionaryTool.main(AMIDictionaryTool.java:361)
Caused by: nu.xom.ParsingException: The entity name must immediately follow the '&' in the entity reference. at line 302, column 913
    at nu.xom.Builder.build(Unknown Source)
    at nu.xom.Builder.build(Unknown Source)
    at org.contentmine.eucl.xml.XMLUtil.parseQuietlyToDocument(XMLUtil.java:1174)
    ... 12 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 302; columnNumber: 913; The entity name must immediately follow the '&' in the entity reference.
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    ... 15 more

###### Error while creating dictionary for indian institutes.
petermr commented 5 years ago

This is almost certainly a bug. The raw text contains an & which is not converted. I will open another issue