petermr / openVirus

aggregation of scholarly publications and extracted knowledge on viruses and epidemics.
The Unlicense
67 stars 17 forks source link

`ami-download` and `AMIDownloadTool.runCommands(args) behave differently #42

Open petermr opened 4 years ago

petermr commented 4 years ago
    public void testBiorxivSmall() throws Exception {

        File target = new File("target/biorxiv1");
        if (target.exists()) {FileUtils.deleteDirectory(target);}
        MatcherAssert.assertThat(target+" does not exist", !target.exists());
        String args = 
                "-p " + target
                + " --site biorxiv" // the type of site 
                + " --query coronavirus" // the query
                + " --pagesize 1" // size of remote pages (may not always work)
                + " --pages 1 1" // number of pages
                + " --fulltext pdf html"
                + " --resultset raw clean"
//              + " --limit 500"  // total number of downloaded results
            ;
        new AMIDownloadTool().runCommands(args);
        Assert.assertTrue("target exists", target.exists());
        // check for reserved and non-reserved child files

This should download 1 page of length 1. When run in Eclipse this gives:

eneric values (AMIDownloadTool)
================================
-v to see generic values
oldstyle            true
0    [main] INFO  org.contentmine.ami.tools.AMIDownloadTool  - set output to: scraped/
0 [main] INFO org.contentmine.ami.tools.AMIDownloadTool  - set output to: scraped/
project         target/biorxiv1

Specific values (AMIDownloadTool)
================================
fulltext           [pdf, html]
limit              2
metadata           metadata
pages              [1, 1]
pagesize           1
query              [coronavirus]
resultSetList      [raw, clean]
site               biorxiv

Query: coronavirus%20sort%3Arelevance-rank%20numresults%3A1
URL https://www.biorxiv.org/search/coronavirus%20sort%3Arelevance-rank%20numresults%3A1
runing curl :https://www.biorxiv.org/search/coronavirus%20sort%3Arelevance-rank%20numresults%3A1 to target/biorxiv1/__metadata/resultSet1.html
wrote resultSet: /Users/pm286/workspace/cmdev/ami3/target/biorxiv1/__metadata/resultSet1.clean.html
getAuthors(); NYI
metadataEntries 1
Results 1
[target/biorxiv1/__metadata/resultSet1.clean.html]
download files in resultSet target/biorxiv1/__metadata/resultSet1.clean.html
result set: target/biorxiv1/__metadata/resultSet1.clean.html
getAuthors(); NYI
metadataEntries 1
download with curl to <tree>scrapedMetadata.html[/content/10.1101/2020.01.30.926477v1]
running batched up curlDownloader for 1 landingPages, takes ca 1-5 sec/page 
ran curlDownloader for 1 landingPages 
downloaded 1 files
skipped: 10_1101_2020_01_30_926477v1
running [curl, -X, GET, https://www.biorxiv.org/content/10.1101/2020.01.30.926477v1]
.writing to :/Users/pm286/workspace/cmdev/ami3/target/biorxiv1/10_1101_2020_01_30_926477v1/abstract.html
writing to :/Users/pm286/workspace/cmdev/ami3/target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.html
writing to :/Users/pm286/workspace/cmdev/ami3/target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.pdf
target/biorxiv1
target/biorxiv1/10_1101_2020_01_30_926477v1/abstract.html
target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.html
target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.pdf
target/biorxiv1/10_1101_2020_01_30_926477v1/landingPage.html
target/biorxiv1/10_1101_2020_01_30_926477v1/resultSet.html
target/biorxiv1/10_1101_2020_01_30_926477v1/scrapedMetadata.html
target/biorxiv1/__metadata/resultSet1.clean.html
target/biorxiv1/__metadata/resultSet1.html

and terminates

The output:

eneric values (AMIDownloadTool)
================================
-v to see generic values
oldstyle            true
0    [main] INFO  org.contentmine.ami.tools.AMIDownloadTool  - set output to: scraped/
0 [main] INFO org.contentmine.ami.tools.AMIDownloadTool  - set output to: scraped/
project         target/biorxiv1

Specific values (AMIDownloadTool)
================================
fulltext           [pdf, html]
limit              2
metadata           metadata
pages              [1, 1]
pagesize           1
query              [coronavirus]
resultSetList      [raw, clean]
site               biorxiv

Query: coronavirus%20sort%3Arelevance-rank%20numresults%3A1
URL https://www.biorxiv.org/search/coronavirus%20sort%3Arelevance-rank%20numresults%3A1
runing curl :https://www.biorxiv.org/search/coronavirus%20sort%3Arelevance-rank%20numresults%3A1 to target/biorxiv1/__metadata/resultSet1.html
wrote resultSet: /Users/pm286/workspace/cmdev/ami3/target/biorxiv1/__metadata/resultSet1.clean.html
getAuthors(); NYI
metadataEntries 1
Results 1
[target/biorxiv1/__metadata/resultSet1.clean.html]
download files in resultSet target/biorxiv1/__metadata/resultSet1.clean.html
result set: target/biorxiv1/__metadata/resultSet1.clean.html
getAuthors(); NYI
metadataEntries 1
download with curl to <tree>scrapedMetadata.html[/content/10.1101/2020.01.30.926477v1]
running batched up curlDownloader for 1 landingPages, takes ca 1-5 sec/page 
ran curlDownloader for 1 landingPages 
downloaded 1 files
skipped: 10_1101_2020_01_30_926477v1
running [curl, -X, GET, https://www.biorxiv.org/content/10.1101/2020.01.30.926477v1]
.writing to :/Users/pm286/workspace/cmdev/ami3/target/biorxiv1/10_1101_2020_01_30_926477v1/abstract.html
writing to :/Users/pm286/workspace/cmdev/ami3/target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.html
writing to :/Users/pm286/workspace/cmdev/ami3/target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.pdf
target/biorxiv1
target/biorxiv1/10_1101_2020_01_30_926477v1/abstract.html
target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.html
target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.pdf
target/biorxiv1/10_1101_2020_01_30_926477v1/landingPage.html
target/biorxiv1/10_1101_2020_01_30_926477v1/resultSet.html
target/biorxiv1/10_1101_2020_01_30_926477v1/scrapedMetadata.html
target/biorxiv1/__metadata/resultSet1.clean.html
target/biorxiv1/__metadata/resultSet1.html
petermr commented 4 years ago

When run on comandline it gives:

pm286macbook:ami3 pm286$ ami-download -p target/biorxiv --site biorxiv --query coronavirus --pagesize 1 --pages 1 1 --fulltext pdf html --resultset raw clean

Generic values (AMIDownloadTool)
================================
-v to see generic values
oldstyle            true
0    [main] INFO  org.contentmine.ami.tools.AMIDownloadTool  - set output to: scraped/
0 [main] INFO org.contentmine.ami.tools.AMIDownloadTool  - set output to: scraped/
project         target/biorxiv

Specific values (AMIDownloadTool)
================================
fulltext           [pdf, html]
limit              2
metadata           metadata
pages              [1, 1]
pagesize           1
query              [coronavirus]
resultSetList      [raw, clean]
site               biorxiv

Query: coronavirus%20sort%3Arelevance-rank%20numresults%3A1
URL https://www.biorxiv.org/search/coronavirus%20sort%3Arelevance-rank%20numresults%3A1
runing curl :https://www.biorxiv.org/search/coronavirus%20sort%3Arelevance-rank%20numresults%3A1 to target/biorxiv/__metadata/resultSet1.html
wrote resultSet: /Users/pm286/workspace/cmdev/ami3/target/biorxiv/__metadata/resultSet1.clean.html
getAuthors(); NYI
metadataEntries 1
Results 1
[target/biorxiv/__metadata/resultSet1.clean.html, target/biorxiv/__metadata/resultSet10.clean.html, target/biorxiv/__metadata/resultSet2.clean.html, target/biorxiv/__metadata/resultSet3.clean.html, target/biorxiv/__metadata/resultSet4.clean.html, target/biorxiv/__metadata/resultSet5.clean.html, target/biorxiv/__metadata/resultSet6.clean.html, target/biorxiv/__metadata/resultSet7.clean.html, target/biorxiv/__metadata/resultSet8.clean.html, target/biorxiv/__metadata/resultSet9.clean.html]
download files in resultSet target/biorxiv/__metadata/resultSet1.clean.html
result set: target/biorxiv/__metadata/resultSet1.clean.html
getAuthors(); NYI
metadataEntries 1
download with curl to <tree>scrapedMetadata.html[/content/10.1101/2020.01.30.926477v1]
running batched up curlDownloader for 1 landingPages, takes ca 1-5 sec/page 
ran curlDownloader for 1 landingPages 
downloaded 1 files
download files in resultSet target/biorxiv/__metadata/resultSet10.clean.html
result set: target/biorxiv/__metadata/resultSet10.clean.html
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
metadataEntries 40
download with curl to <tree>scrapedMetadata.html[/content/10.1101/581512v2, /content/10.1101/856518v1, /content/10.1101/2020.01.24.919282v1, /content/10.1101/732255v3, /content/10.1101/800300v1, /content/10.1101/2020.02.06.936302v3, /content/10.1101/840090v2, /content/10.1101/353037v1, /content/10.1101/2020.01.12.902452v1, /content/10.1101/606715v1, /content/10.1101/695510v1, /content/10.1101/2020.01.10.901801v1, /content/10.1101/599043v1, /content/10.1101/094623v1, /content/10.1101/271171v2, /content/10.1101/2020.03.09.984393v1, /content/10.1101/2020.03.07.982207v1, /content/10.1101/780841v1, /content/10.1101/326546v1, /content/10.1101/676155v1, /content/10.1101/2019.12.18.880849v1, /content/10.1101/777847v2, /content/10.1101/2019.12.20.885590v1, /content/10.1101/2020.02.26.966143v1, /content/10.1101/402800v1, /content/10.1101/2019.12.16.875872v2, /content/10.1101/2020.02.16.946699v1, /content/10.1101/498998v1, /content/10.1101/2020.02.10.942847v1, /content/10.1101/623819v1, /content/10.1101/485060v1, /content/10.1101/476341v1, /content/10.1101/2020.04.02.020081v1, /content/10.1101/548909v1, /content/10.1101/2020.03.25.007534v1, /content/10.1101/2020.01.09.900555v1, /content/10.1101/812313v1, /content/10.1101/804716v1, /content/10.1101/2019.12.21.885921v2, /content/10.1101/296996v1]
running batched up curlDownloader for 40 landingPages, takes ca 1-5 sec/page 

It ignores the page restrictions and starts downloading everything.