Open Priya-Jk-15 opened 4 years ago
Thanks I think this is a at least a documentation bug (or worse) and I have to fix it.
Try the following:
terpenes.txt
), one per line:
menthol
thymol
then run:
amidict -v --dictionary myterpenes --directory junkdir --input junkterms.txt create --informat list --outformats xml,html
and get
Generic values (DictionaryCreationTool)
================================
--testString: null (default)
--wikilinks: [Lorg.contentmine.ami.tools.AbstractAMIDictTool$WikiLink;@701fc37a (default)
--datacols: null (default)
--hrefcols: null (default)
--informat: list (matched)
--linkcol: null (default)
--namecol: null (default)
--outformats: [Lorg.contentmine.ami.tools.AbstractAMIDictTool$DictionaryFileFormat;@4148db48 (matched)
--query: 10 (default)
--template: null (default)
--termcol: null (default)
--termfile: null (default)
--terms: null (default)
--wptype: null (default)
--help: false (default)
--version: false (default)
Specific values (DictionaryCreationTool)
================================
--testString: null (default)
--wikilinks: [Lorg.contentmine.ami.tools.AbstractAMIDictTool$WikiLink;@701fc37a (default)
--datacols: null (default)
--hrefcols: null (default)
--informat: list (matched)
--linkcol: null (default)
--namecol: null (default)
--outformats: [Lorg.contentmine.ami.tools.AbstractAMIDictTool$DictionaryFileFormat;@4148db48 (matched)
--query: 10 (default)
--template: null (default)
--termcol: null (default)
--termfile: null (default)
--terms: null (default)
--wptype: null (default)
--help: false (default)
--version: false (default)
N 2; T 2
[Fatal Error] :2214:5: The element type "input" must be terminated by the matching end-tag "</input>".
0 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool - cannot parse wikipedia page for: menthol; cannot parse/read stream:
0 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool - cannot parse wikipedia page for: menthol; cannot parse/read stream:
[Fatal Error] :1298:5: The element type "input" must be terminated by the matching end-tag "</input>".
187 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool - cannot parse wikipedia page for: thymol; cannot parse/read stream:
187 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool - cannot parse wikipedia page for: thymol; cannot parse/read stream:
++>> myterpenes
>> dict myterpenes
writing dictionary to /Users/pm286/projects/junk/junkdir/myterpenes.xml
writing dictionary to /Users/pm286/projects/junk/junkdir/myterpenes.html
This creates the dictionaries (actually we can probably drop the html
)
The Wikipedia errors mean that the format of the Wikipedia page has changed and we need to change the code. This is tedious and common with remote sites.
Where should I open that file to put the terms? Should I open it in openVirus/Dictionaries?
Kareena has already correctly created https://github.com/petermr/openVirus/blob/master/dictionaries/virus
I think you will need https://github.com/petermr/openVirus/blob/master/dictionaries/disease
We should probably have a 6th directory: https://github.com/petermr/openVirus/blob/master/dictionaries/test
where anyone can create and test small dictionaries
While running the command as per your suggestion @petermr, the dictionary is created but there are multiple errors and some values are left out. eg in my case the input countries were 180 while the output .xml file had only 117 entries.
The errors look something like this:
Cannot add entry: nu.xom.ParsingException: The element type "input" must be terminated by the matching end-tag "</input>". at line 186, column 5
[Fatal Error] :1825:5: The element type "input" must be terminated by the matching end-tag "</input>".
1142608 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool - cannot parse wikipedia page for: yuan dynasty; cannot parse/read stream:
1142608 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool - cannot parse wikipedia page for: yuan dynasty; cannot parse/read stream:
[Fatal Error] :186:5: The element type "input" must be terminated by the matching end-tag "</input>".
<186/5>badline > </div>
</div>
Cannot add entry: nu.xom.ParsingException: The element type "input" must be terminated by the matching end-tag "</input>". at line 186, column 5
[Fatal Error] :1712:5: The element type "input" must be terminated by the matching end-tag "</input>".
1147681 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool - cannot parse wikipedia page for: zambia; cannot parse/read stream:
1147681 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool - cannot parse wikipedia page for: zambia; cannot parse/read stream:
[Fatal Error] :186:5: The element type "input" must be terminated by the matching end-tag "</input>".
<186/5>badline > </div>
</div>
Cannot add entry: nu.xom.ParsingException: The element type "input" must be terminated by the matching end-tag "</input>". at line 186, column 5
[Fatal Error] :2431:5: The element type "input" must be terminated by the matching end-tag "</input>".
1153027 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool - cannot parse wikipedia page for: zimbabwe; cannot parse/read stream:
1153027 [main] ERROR org.contentmine.ami.tools.AbstractAMIDictTool - cannot parse wikipedia page for: zimbabwe; cannot parse/read stream:
[Fatal Error] :186:5: The element type "input" must be terminated by the matching end-tag "</input>".
<186/5>badline > </div>
</div>
Cannot add entry: nu.xom.ParsingException: The element type "input" must be terminated by the matching end-tag "</input>". at line 186, column 5
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>> country
>> dict country
writing dictionary to C:\Users\eless\country\country.xml
writing dictionary to C:\Users\eless\country\country.html
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Missing wikipedia: :
county of nassau; kingdom of aksum; kingdom of the netherlands; nassau-hadamar; northern mariana islands; polska ludowa; principality of iberia; q11908127;
q21076477; q30904761; rattanakosin kingdom; sahrawi arab democratic republic; saint lucia; sovereign military order of malta; s?o tomé and príncipe; the gambia;
Missing wikidata: :
Hi, these sound like simple XML parsing errors, suggesting that your input is not well-formed XML. Use something like https://www.xmlvalidation.com/, or post it here for me to check.
The input was provided as a .txt file as was indicated in the discussion within this thread do I need to convert it to XML first?
It will be the Wikimedia pages. I will have to out them through a different cleaner. Most HTML in the wild is awful.
On Wed, Jun 17, 2020 at 9:11 AM Ambreen H notifications@github.com wrote:
The input was provided as a .txt file as was indicated in the discussion within this thread do I need to convert it to XML first?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645224423, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS74FJXP3SSRSQ2EUXTRXB3CHANCNFSM4N7U6HWQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
sorry out => put
On Wed, Jun 17, 2020 at 9:24 AM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:
It will be the Wikimedia pages. I will have to out them through a different cleaner. Most HTML in the wild is awful.
On Wed, Jun 17, 2020 at 9:11 AM Ambreen H notifications@github.com wrote:
The input was provided as a .txt file as was indicated in the discussion within this thread do I need to convert it to XML first?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645224423, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS74FJXP3SSRSQ2EUXTRXB3CHANCNFSM4N7U6HWQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
@petermr where are you looking up these terms? Can't we access a data source that is well-formed XML, e.g. by using an Accept header?
They are Wikipedia pages. There is no alternative.
Here's the culprit
On Wed, Jun 17, 2020 at 9:55 AM Richard Light notifications@github.com wrote:
This is HTML5 - its' not well formed. It was a pointless exercise. The onus is on me to parse it. Drives me wild.
@petermr https://github.com/petermr where are you looking up these terms?
Can't we access a data source that is well-formed XML, e.g. by using an Accept header?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645246678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4A5C6DAXFRUAFBR5LRXCAIJANCNFSM4N7U6HWQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Another thing that screws people is DTDs. Often they are not resolvable and we crash. I strip all DTDs.
On Wed, Jun 17, 2020 at 11:12 AM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:
They are Wikipedia pages. There is no alternative.
Here's the culprit
On Wed, Jun 17, 2020 at 9:55 AM Richard Light notifications@github.com wrote:
This is HTML5 - its' not well formed. It was a pointless exercise. The onus is on me to parse it. Drives me wild.
@petermr https://github.com/petermr where are you looking up these
terms? Can't we access a data source that is well-formed XML, e.g. by using an Accept header?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645246678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4A5C6DAXFRUAFBR5LRXCAIJANCNFSM4N7U6HWQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Any scope for using Wikipedia's Linked Data twin dbpedia? (Not sure from the above exactly what you are searching for.)
I don't think DBPedia gives any advantages. IN any case we are tooled up for Wikidata.
On Wed, Jun 17, 2020 at 11:50 AM Richard Light notifications@github.com wrote:
Any scope for using Wikipedia's Linked Data twin dbpedia? (Not sure from the above exactly what you are searching for.)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645301250, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSY35GJWPQWKT56GDT3RXCNXFANCNFSM4N7U6HWQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Then why are you searching in Wikipedia?
On Wed, Jun 17, 2020 at 5:52 PM Richard Light notifications@github.com wrote:
Then why are you searching in Wikipedia?
We're searching in both.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645493439, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS63IEFP62S3MMAOFT3RXDYEJANCNFSM4N7U6HWQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
OK, well the advantage of dbpedia is that you can put SPARQL queries to it, and get back machine-processible responses. Of course, it may not have the content you are searching for: it's just the content of the info boxes. If you give me a specific query requirement, I can look into how well dbpedia might be able to answer it.
But that's what Wikidata does. It has effectively subsumed DBPedia (it has all the infoboxes and a lot more directly donated.
On Wed, Jun 17, 2020 at 9:48 PM Richard Light notifications@github.com wrote:
OK, well the advantage of dbpedia is that you can put SPARQL queries to it, and get back machine-processible responses. Of course, it may not have the content you are searching for: it's just the content of the info boxes. If you give me a specific query requirement, I can look into how well dbpedia might be able to answer it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645616360, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4H4QNMIWWHUDS5KZ3RXETY3ANCNFSM4N7U6HWQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Wikidata is a separate exercise (with its own problems of data consistency). My point is that dbpedia is in lockstep with (= is automatically extracted from) Wikipedia, so if you're searching for a Wikipedia page with a title which matches your dictionary term, you could just as well be searching for the equivalent 'page' in dbpedia. By doing that, you can establish whether the page exists (and therefore whether the corresponding Wikipedia page exists) and get back a reliable machine-processible response. Which brings me back to my original question: what information is there in the corresponding Wikipedia page which we need, and which we can't get from dbpedia?
There is a lot of software that needs to be written. If you're volunteering to do this for DBP and show that it's superior to WD, fine. But it's not top priority. We work closely with the Wikidata group in Berlin.
On Thu, Jun 18, 2020 at 9:22 AM Richard Light notifications@github.com wrote:
Wikidata is a separate exercise (with its own problems of data consistency). My point is that dbpedia is in lockstep with (= is automatically extracted from) Wikipedia, so if you're searching for a Wikipedia page with a title which matches your dictionary term, you could just as well be searching for the equivalent 'page' in dbpedia. By doing that, you can establish whether the page exists (and therefore whether the corresponding Wikipedia page exists) and get back a reliable machine-processible response. Which brings me back to my original question: what information is there in the corresponding Wikipedia page which we need, and which we can't get from dbpedia?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/62#issuecomment-645862386, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS3JUOGCSMDLL6QE3GLRXHFEJANCNFSM4N7U6HWQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Peter, please re-read my comments, noting that I am simply trying to address the problems reported above that Wikipedia responses are not processible. I am not suggesting replacing WD by DBP!
I am trying to create a dictionary using
amidict
commands' in Windows 10.I have installed AMI and checked its installation using
ami --help
. I have also usedgetpapers
in downloading the papers andami -search
in arranging the papers with respect to the required dictionaries.Now, I am trying to run
amidict
for creating a new dictionary. I was able to give the commandamidict --help
and it showed the commands. (as per in FAQ https://github.com/petermr/openVirus/wiki/FAQ )But when I gave the command
for testing to create a dictionary from tigr2ess tutorial https://github.com/petermr/tigr2ess/blob/master/dictionaries/TUTORIAL.md . I got the following output as
no directory given
I have also tried creating a new directory and gave the same above command. But the same
no directory given
was the output.What shall I change in the syntax to create a new dictionary? Kindly guide me.