petermr / openVirus

aggregation of scholarly publications and extracted knowledge on viruses and epidemics.
The Unlicense
67 stars 17 forks source link

Documenting Testing Expanding AMIDownload #39

Open petermr opened 4 years ago

petermr commented 4 years ago

AMIDownloadTool is a wrapper for various ways of crawling scraping sites. The best developed is biorxiv . This is complex:

petermr commented 4 years ago

Lezan,. I created a small test in that class which needs manual checking. Does it give the same files as the manual search and click gives? Let's ask @anjackson what the right names are...

l-hawizy commented 4 years ago

I have had a look at AMIDownloadTest, these are the errors I found: A few seem to have an IllegalThreadStateException which i am not sure how to go about testBiorxivSmall() only fails because there is no variable for the landingpage argument on line 67 and its missing a comma between html and pdf in line 68

testBiorxivClimate could be a false assert statement, it's looking for a folder called metadata and file called page1.html, but it's created a __metadata folders instead with under it called resultSetX.html are those the same thing?

l-hawizy commented 4 years ago

Current Issues in AMIDownloadTest (run through eclipse): Errors: testHALSearchResultSet(): picocli.CommandLine$ExecutionException: Error while calling command (org.contentmine.ami.tools.AMIDownloadTool@3a53c76a): java.lang.RuntimeException: nu.xom.ParsingException: The declaration for the entity "HTML.Version" must end with '>'. at line 31, column 3 in http://www.w3.org/TR/html4/loose.dtd testCreateUnpopulatedCTreesFromResultSet(): source dir target/biorxiv/climate doesnt exist because it was run before testBiorxivClimate testDownloadAndSearchLongIT(): missing argument for pagesize on line 566

Test Failures: testAMISearch(): not creating the testsearch3 dir testSections(): same as above testSearch(): same

testBiorxivClimate(): assertion error for landingPage (doesn't exist) testRelativeFile(): assertNotNull fails on file

petermr commented 4 years ago

Have @Ignore'd this test.

On Wed, Apr 8, 2020 at 1:38 PM l-hawizy notifications@github.com wrote:

Current Issues in AMIDownloadTest (run through eclipse): Errors: testHALSearchResultSet(): picocli.CommandLine$ExecutionException: Error while calling command (org.contentmine.ami.tools.AMIDownloadTool@3a53c76a): java.lang.RuntimeException: nu.xom.ParsingException: The declaration for the entity "HTML.Version" must end with '>'. at line 31, column 3 in http://www.w3.org/TR/html4/loose.dtd testCreateUnpopulatedCTreesFromResultSet(): source dir target/biorxiv/climate doesnt exist because it was run before testBiorxivClimate testDownloadAndSearchLongIT(): missing argument for pagesize on line 566

Test Failures: testAMISearch(): not creating the testsearch3 dir testSections(): same as above testSearch(): same

testBiorxivClimate(): assertion error for landingPage (doesn't exist) testRelativeFile(): assertNotNull fails on file

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/39#issuecomment-610934134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6OO5WQEMKPU5YK7P3RLRV4ZANCNFSM4MBDDG7A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

anjackson commented 4 years ago

On naming (if I'm not too late/irrelevant), FWIW, this is the way I'd describe the usual flow:

Unless the papers are HTML fulltext, in which case there are usually no Landing Pages, I think.