petermr / openVirus

aggregation of scholarly publications and extracted knowledge on viruses and epidemics.
The Unlicense
67 stars 17 forks source link

Running `amidict` or `ami-dictionary` in Windows 10 #62

Open Priya-Jk-15 opened 4 years ago

Priya-Jk-15 commented 4 years ago

I am trying to create a dictionary using amidict commands' in Windows 10.

I have installed AMI and checked its installation using ami --help. I have also used getpapers in downloading the papers and ami -search in arranging the papers with respect to the required dictionaries. ami error unknown1 (2)

Now, I am trying to run amidict for creating a new dictionary. I was able to give the command amidict --help and it showed the commands. (as per in FAQ https://github.com/petermr/openVirus/wiki/FAQ ) amidict

But when I gave the command

amidict create --terms thymol menthol --dictionary myterprenes --directory Dictionary --outformats xml,html

for testing to create a dictionary from tigr2ess tutorial https://github.com/petermr/tigr2ess/blob/master/dictionaries/TUTORIAL.md . I got the following output as no directory given

Generic values (DictionaryCreationTool)
====================
-v to see generic values
Specific values (DictionaryCreationTool)
====================
java.lang.RuntimeException: no directory given
        at org.contentmine.ami.tools.dictionary.DictionaryCreationTool.createDictionary(DictionaryCreationTool.java:265)
[...]

I have also tried creating a new directory and gave the same above command. But the same no directory given was the output.

What shall I change in the syntax to create a new dictionary? Kindly guide me.

petermr commented 4 years ago

Thanks I think this is a at least a documentation bug (or worse) and I have to fix it.

petermr commented 4 years ago

Try the following:

then run:

amidict -v --dictionary myterpenes --directory junkdir --input junkterms.txt create --informat list --outformats xml,html

and get

Generic values (DictionaryCreationTool)
================================
--testString: null (default)
--wikilinks: [Lorg.contentmine.ami.tools.AbstractAMIDictTool$WikiLink;@701fc37a (default)
--datacols: null (default)
--hrefcols: null (default)
--informat: list (matched)
--linkcol: null (default)
--namecol: null (default)
--outformats: [Lorg.contentmine.ami.tools.AbstractAMIDictTool$DictionaryFileFormat;@4148db48 (matched)
--query: 10 (default)
--template: null (default)
--termcol: null (default)
--termfile: null (default)
--terms: null (default)
--wptype: null (default)
--help: false (default)
--version: false (default)

Specific values (DictionaryCreationTool)
================================
--testString: null (default)
--wikilinks: [Lorg.contentmine.ami.tools.AbstractAMIDictTool$WikiLink;@701fc37a (default)
--datacols: null (default)
--hrefcols: null (default)
--informat: list (matched)
--linkcol: null (default)
--namecol: null (default)
--outformats: [Lorg.contentmine.ami.tools.AbstractAMIDictTool$DictionaryFileFormat;@4148db48 (matched)
--query: 10 (default)
--template: null (default)
--termcol: null (default)
--termfile: null (default)
--terms: null (default)
--wptype: null (default)
--help: false (default)
--version: false (default)
N 2; T 2
[Fatal Error] :2214:5: The element type "input" must be terminated by the matching end-tag "</input>".
0    [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool  - cannot parse wikipedia page for: menthol; cannot parse/read stream: 
0 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool  - cannot parse wikipedia page for: menthol; cannot parse/read stream: 
[Fatal Error] :1298:5: The element type "input" must be terminated by the matching end-tag "</input>".
187  [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool  - cannot parse wikipedia page for: thymol; cannot parse/read stream: 
187 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool  - cannot parse wikipedia page for: thymol; cannot parse/read stream: 
++>> myterpenes
>> dict myterpenes
writing dictionary to /Users/pm286/projects/junk/junkdir/myterpenes.xml
writing dictionary to /Users/pm286/projects/junk/junkdir/myterpenes.html

This creates the dictionaries (actually we can probably drop the html)

The Wikipedia errors mean that the format of the Wikipedia page has changed and we need to change the code. This is tedious and common with remote sites.

Priya-Jk-15 commented 4 years ago

Where should I open that file to put the terms? Should I open it in openVirus/Dictionaries?

petermr commented 4 years ago

Kareena has already correctly created https://github.com/petermr/openVirus/blob/master/dictionaries/virus

I think you will need https://github.com/petermr/openVirus/blob/master/dictionaries/disease

We should probably have a 6th directory: https://github.com/petermr/openVirus/blob/master/dictionaries/test

where anyone can create and test small dictionaries

AmbrineH commented 4 years ago

While running the command as per your suggestion @petermr, the dictionary is created but there are multiple errors and some values are left out. eg in my case the input countries were 180 while the output .xml file had only 117 entries.

The errors look something like this:

Cannot add entry: nu.xom.ParsingException: The element type "input" must be terminated by the matching end-tag "</input>". at line 186, column 5
[Fatal Error] :1825:5: The element type "input" must be terminated by the matching end-tag "</input>".
1142608 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool  - cannot parse wikipedia page for: yuan dynasty; cannot parse/read stream:
1142608 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool  - cannot parse wikipedia page for: yuan dynasty; cannot parse/read stream:
[Fatal Error] :186:5: The element type "input" must be terminated by the matching end-tag "</input>".
<186/5>badline >                </div>
                </div>
Cannot add entry: nu.xom.ParsingException: The element type "input" must be terminated by the matching end-tag "</input>". at line 186, column 5
[Fatal Error] :1712:5: The element type "input" must be terminated by the matching end-tag "</input>".
1147681 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool  - cannot parse wikipedia page for: zambia; cannot parse/read stream:
1147681 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool  - cannot parse wikipedia page for: zambia; cannot parse/read stream:
[Fatal Error] :186:5: The element type "input" must be terminated by the matching end-tag "</input>".
<186/5>badline >                </div>
                </div>
Cannot add entry: nu.xom.ParsingException: The element type "input" must be terminated by the matching end-tag "</input>". at line 186, column 5
[Fatal Error] :2431:5: The element type "input" must be terminated by the matching end-tag "</input>".
1153027 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool  - cannot parse wikipedia page for: zimbabwe; cannot parse/read stream:
1153027 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool  - cannot parse wikipedia page for: zimbabwe; cannot parse/read stream:
[Fatal Error] :186:5: The element type "input" must be terminated by the matching end-tag "</input>".
<186/5>badline >                </div>
                </div>
Cannot add entry: nu.xom.ParsingException: The element type "input" must be terminated by the matching end-tag "</input>". at line 186, column 5
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>> country
>> dict country
writing dictionary to C:\Users\eless\country\country.xml
writing dictionary to C:\Users\eless\country\country.html
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Missing wikipedia: :

county of nassau; kingdom of aksum; kingdom of the netherlands; nassau-hadamar; northern mariana islands; polska ludowa; principality of iberia; q11908127;
q21076477; q30904761; rattanakosin kingdom; sahrawi arab democratic republic; saint lucia; sovereign military order of malta; s?o tomé and príncipe; the gambia;

Missing wikidata: : 
richardofsussex commented 4 years ago

Hi, these sound like simple XML parsing errors, suggesting that your input is not well-formed XML. Use something like https://www.xmlvalidation.com/, or post it here for me to check.

AmbrineH commented 4 years ago

The input was provided as a .txt file as was indicated in the discussion within this thread do I need to convert it to XML first?

petermr commented 4 years ago

It will be the Wikimedia pages. I will have to out them through a different cleaner. Most HTML in the wild is awful.

On Wed, Jun 17, 2020 at 9:11 AM Ambreen H notifications@github.com wrote:

The input was provided as a .txt file as was indicated in the discussion within this thread do I need to convert it to XML first?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645224423, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS74FJXP3SSRSQ2EUXTRXB3CHANCNFSM4N7U6HWQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

sorry out => put

On Wed, Jun 17, 2020 at 9:24 AM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

It will be the Wikimedia pages. I will have to out them through a different cleaner. Most HTML in the wild is awful.

On Wed, Jun 17, 2020 at 9:11 AM Ambreen H notifications@github.com wrote:

The input was provided as a .txt file as was indicated in the discussion within this thread do I need to convert it to XML first?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645224423, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS74FJXP3SSRSQ2EUXTRXB3CHANCNFSM4N7U6HWQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

richardofsussex commented 4 years ago

@petermr where are you looking up these terms? Can't we access a data source that is well-formed XML, e.g. by using an Accept header?

petermr commented 4 years ago

They are Wikipedia pages. There is no alternative.

Here's the culprit

On Wed, Jun 17, 2020 at 9:55 AM Richard Light notifications@github.com wrote:

This is HTML5 - its' not well formed. It was a pointless exercise. The onus is on me to parse it. Drives me wild.

@petermr https://github.com/petermr where are you looking up these terms?

Can't we access a data source that is well-formed XML, e.g. by using an Accept header?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645246678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4A5C6DAXFRUAFBR5LRXCAIJANCNFSM4N7U6HWQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

Another thing that screws people is DTDs. Often they are not resolvable and we crash. I strip all DTDs.

On Wed, Jun 17, 2020 at 11:12 AM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

They are Wikipedia pages. There is no alternative.

Here's the culprit

On Wed, Jun 17, 2020 at 9:55 AM Richard Light notifications@github.com wrote:

This is HTML5 - its' not well formed. It was a pointless exercise. The onus is on me to parse it. Drives me wild.

@petermr https://github.com/petermr where are you looking up these

terms? Can't we access a data source that is well-formed XML, e.g. by using an Accept header?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645246678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4A5C6DAXFRUAFBR5LRXCAIJANCNFSM4N7U6HWQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

richardofsussex commented 4 years ago

Any scope for using Wikipedia's Linked Data twin dbpedia? (Not sure from the above exactly what you are searching for.)

petermr commented 4 years ago

I don't think DBPedia gives any advantages. IN any case we are tooled up for Wikidata.

On Wed, Jun 17, 2020 at 11:50 AM Richard Light notifications@github.com wrote:

Any scope for using Wikipedia's Linked Data twin dbpedia? (Not sure from the above exactly what you are searching for.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645301250, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSY35GJWPQWKT56GDT3RXCNXFANCNFSM4N7U6HWQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

richardofsussex commented 4 years ago

Then why are you searching in Wikipedia?

petermr commented 4 years ago

On Wed, Jun 17, 2020 at 5:52 PM Richard Light notifications@github.com wrote:

Then why are you searching in Wikipedia?

We're searching in both.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645493439, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS63IEFP62S3MMAOFT3RXDYEJANCNFSM4N7U6HWQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

richardofsussex commented 4 years ago

OK, well the advantage of dbpedia is that you can put SPARQL queries to it, and get back machine-processible responses. Of course, it may not have the content you are searching for: it's just the content of the info boxes. If you give me a specific query requirement, I can look into how well dbpedia might be able to answer it.

petermr commented 4 years ago

But that's what Wikidata does. It has effectively subsumed DBPedia (it has all the infoboxes and a lot more directly donated.

On Wed, Jun 17, 2020 at 9:48 PM Richard Light notifications@github.com wrote:

OK, well the advantage of dbpedia is that you can put SPARQL queries to it, and get back machine-processible responses. Of course, it may not have the content you are searching for: it's just the content of the info boxes. If you give me a specific query requirement, I can look into how well dbpedia might be able to answer it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645616360, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4H4QNMIWWHUDS5KZ3RXETY3ANCNFSM4N7U6HWQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

richardofsussex commented 4 years ago

Wikidata is a separate exercise (with its own problems of data consistency). My point is that dbpedia is in lockstep with (= is automatically extracted from) Wikipedia, so if you're searching for a Wikipedia page with a title which matches your dictionary term, you could just as well be searching for the equivalent 'page' in dbpedia. By doing that, you can establish whether the page exists (and therefore whether the corresponding Wikipedia page exists) and get back a reliable machine-processible response. Which brings me back to my original question: what information is there in the corresponding Wikipedia page which we need, and which we can't get from dbpedia?

petermr commented 4 years ago

There is a lot of software that needs to be written. If you're volunteering to do this for DBP and show that it's superior to WD, fine. But it's not top priority. We work closely with the Wikidata group in Berlin.

On Thu, Jun 18, 2020 at 9:22 AM Richard Light notifications@github.com wrote:

Wikidata is a separate exercise (with its own problems of data consistency). My point is that dbpedia is in lockstep with (= is automatically extracted from) Wikipedia, so if you're searching for a Wikipedia page with a title which matches your dictionary term, you could just as well be searching for the equivalent 'page' in dbpedia. By doing that, you can establish whether the page exists (and therefore whether the corresponding Wikipedia page exists) and get back a reliable machine-processible response. Which brings me back to my original question: what information is there in the corresponding Wikipedia page which we need, and which we can't get from dbpedia?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645862386, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS3JUOGCSMDLL6QE3GLRXHFEJANCNFSM4N7U6HWQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

richardofsussex commented 4 years ago

Peter, please re-read my comments, noting that I am simply trying to address the problems reported above that Wikipedia responses are not processible. I am not suggesting replacing WD by DBP!