Open petermr opened 4 years ago
Lezan,. I created a small test in that class which needs manual checking. Does it give the same files as the manual search and click gives? Let's ask @anjackson what the right names are...
I have had a look at AMIDownloadTest, these are the errors I found:
A few seem to have an IllegalThreadStateException which i am not sure how to go about testBiorxivSmall() only fails because there is no variable for the landingpage
argument on line 67 and its missing a comma between html and pdf in line 68
testBiorxivClimate could be a false assert statement, it's looking for a folder called metadata and file called page1.html, but it's created a __metadata
folders instead with under it called resultSetX.html
are those the same thing?
Current Issues in AMIDownloadTest (run through eclipse):
Errors:
testHALSearchResultSet()
: picocli.CommandLine$ExecutionException: Error while calling command (org.contentmine.ami.tools.AMIDownloadTool@3a53c76a): java.lang.RuntimeException: nu.xom.ParsingException: The declaration for the entity "HTML.Version" must end with '>'. at line 31, column 3 in http://www.w3.org/TR/html4/loose.dtd
testCreateUnpopulatedCTreesFromResultSet()
: source dir target/biorxiv/climate
doesnt exist because it was run before testBiorxivClimate
testDownloadAndSearchLongIT()
: missing argument for pagesize on line 566
Test Failures:
testAMISearch()
: not creating the testsearch3
dir
testSections()
: same as above
testSearch()
: same
testBiorxivClimate()
: assertion error for landingPage (doesn't exist)
testRelativeFile()
: assertNotNull fails on file
Have @Ignore'd this test.
On Wed, Apr 8, 2020 at 1:38 PM l-hawizy notifications@github.com wrote:
Current Issues in AMIDownloadTest (run through eclipse): Errors: testHALSearchResultSet(): picocli.CommandLine$ExecutionException: Error while calling command (org.contentmine.ami.tools.AMIDownloadTool@3a53c76a): java.lang.RuntimeException: nu.xom.ParsingException: The declaration for the entity "HTML.Version" must end with '>'. at line 31, column 3 in http://www.w3.org/TR/html4/loose.dtd testCreateUnpopulatedCTreesFromResultSet(): source dir target/biorxiv/climate doesnt exist because it was run before testBiorxivClimate testDownloadAndSearchLongIT(): missing argument for pagesize on line 566
Test Failures: testAMISearch(): not creating the testsearch3 dir testSections(): same as above testSearch(): same
testBiorxivClimate(): assertion error for landingPage (doesn't exist) testRelativeFile(): assertNotNull fails on file
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/39#issuecomment-610934134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6OO5WQEMKPU5YK7P3RLRV4ZANCNFSM4MBDDG7A .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
On naming (if I'm not too late/irrelevant), FWIW, this is the way I'd describe the usual flow:
meta
tag in the HTML headers, e.g. citation_pdf_url
or DC.identifier tags
(section 2.F).fulltext.pdf
and convert to scholarly.html
.Unless the papers are HTML fulltext, in which case there are usually no Landing Pages, I think.
AMIDownloadTool
is a wrapper for various ways of crawling scraping sites. The best developed isbiorxiv
. This is complex:biorxiv
gives a hit list in HTML