Documenting Testing Expanding AMIDownload

petermr / openVirus

aggregation of scholarly publications and extracted knowledge on viruses and epidemics.

The Unlicense

67 stars 17 forks source link

Documenting Testing Expanding AMIDownload #39

Open petermr opened 4 years ago

petermr commented 4 years ago

AMIDownloadTool is a wrapper for various ways of crawling scraping sites. The best developed is biorxiv . This is complex:

Manual search on biorxiv gives a hit list in HTML
we turn this into a single file ("ResultSet")
this set points to individual landing sets which we download in HTML
these then point to individual fulltext.html and fulltext.pdf files

petermr commented 4 years ago

Lezan,. I created a small test in that class which needs manual checking. Does it give the same files as the manual search and click gives? Let's ask @anjackson what the right names are...

l-hawizy commented 4 years ago

I have had a look at AMIDownloadTest, these are the errors I found: A few seem to have an IllegalThreadStateException which i am not sure how to go about testBiorxivSmall() only fails because there is no variable for the landingpage argument on line 67 and its missing a comma between html and pdf in line 68

testBiorxivClimate could be a false assert statement, it's looking for a folder called metadata and file called page1.html, but it's created a __metadata folders instead with under it called resultSetX.html are those the same thing?

l-hawizy commented 4 years ago

Current Issues in AMIDownloadTest (run through eclipse): Errors: testHALSearchResultSet(): picocli.CommandLine$ExecutionException: Error while calling command (org.contentmine.ami.tools.AMIDownloadTool@3a53c76a): java.lang.RuntimeException: nu.xom.ParsingException: The declaration for the entity "HTML.Version" must end with '>'. at line 31, column 3 in http://www.w3.org/TR/html4/loose.dtd testCreateUnpopulatedCTreesFromResultSet(): source dir target/biorxiv/climate doesnt exist because it was run before testBiorxivClimate testDownloadAndSearchLongIT(): missing argument for pagesize on line 566

Test Failures: testAMISearch(): not creating the testsearch3 dir testSections(): same as above testSearch(): same

testBiorxivClimate(): assertion error for landingPage (doesn't exist) testRelativeFile(): assertNotNull fails on file

petermr commented 4 years ago

Have @Ignore'd this test.

On Wed, Apr 8, 2020 at 1:38 PM l-hawizy notifications@github.com wrote:

Current Issues in AMIDownloadTest (run through eclipse): Errors: testHALSearchResultSet(): picocli.CommandLine$ExecutionException: Error while calling command (org.contentmine.ami.tools.AMIDownloadTool@3a53c76a): java.lang.RuntimeException: nu.xom.ParsingException: The declaration for the entity "HTML.Version" must end with '>'. at line 31, column 3 in http://www.w3.org/TR/html4/loose.dtd testCreateUnpopulatedCTreesFromResultSet(): source dir target/biorxiv/climate doesnt exist because it was run before testBiorxivClimate testDownloadAndSearchLongIT(): missing argument for pagesize on line 566

Test Failures: testAMISearch(): not creating the testsearch3 dir testSections(): same as above testSearch(): same

testBiorxivClimate(): assertion error for landingPage (doesn't exist) testRelativeFile(): assertNotNull fails on file

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/39#issuecomment-610934134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6OO5WQEMKPU5YK7P3RLRV4ZANCNFSM4MBDDG7A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

anjackson commented 4 years ago

On naming (if I'm not too late/irrelevant), FWIW, this is the way I'd describe the usual flow:

From a search, we get multiple SearchResultsPages.
Combining these gets a Search Results Set.
Each will usually point to a set of Landing Pages (at least that what we call them at work).
Each Landing Page should point to the PDF (if it's open access), hopefully using a fairly standard meta tag in the HTML headers, e.g. citation_pdf_url or DC.identifier tags (section 2.F).
We then have to grab the fulltext.pdf and convert to scholarly.html.

Unless the papers are HTML fulltext, in which case there are usually no Landing Pages, I think.