BiorxivDownloadTool retrieves the hitLists correctly but in the wrong order.
AMIDownloadTest.testSmallMultipageDownload shows the problem
running
ami -p aardvark download --site biorxiv --query aardvark --pagesize 10 --pages 1 4 --fulltext html
should retrieve all queries on aardvark . There are 21 (it might increase by 1 or 2 if new ones are added )
It should retrieve 3 pages of 10, 10 and 1 hits
Instead it gives:
ami -p aardvark download --site biorxiv --query aardvark --pagesize 10 --pages 1 4 --fulltext html
Generic values (AMIDownloadTool)
================================
-v to see generic values
0 [main] INFO org.contentmine.ami.tools.AMIDownloadTool - set output to: scraped/
0 [main] INFO org.contentmine.ami.tools.AMIDownloadTool - set output to: scraped/
project aardvark
Specific values (AMIDownloadTool)
================================
fulltext [html]
limit 20
metadata metadata
pages [1, 4]
pagesize 10
query [aardvark]
hitListList []
site biorxiv
file types []
Query: aardvark%20sort%3Arelevance-rank%20numresults%3A10
URL https://www.biorxiv.org/search/aardvark%20sort%3Arelevance-rank%20numresults%3A10
running curl :https://www.biorxiv.org/search/aardvark%20sort%3Arelevance-rank%20numresults%3A10?page=1 to aardvark/__metadata/hitList1.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/biorxiv/aardvark/aardvark/__metadata/hitList1.clean.html
metadataEntries 10
Results 10
calculating hits NYI
running curl :https://www.biorxiv.org/search/aardvark%20sort%3Arelevance-rank%20numresults%3A10?page=2 to aardvark/__metadata/hitList2.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/biorxiv/aardvark/aardvark/__metadata/hitList2.clean.html
metadataEntries 1
Results 1
running curl :https://www.biorxiv.org/search/aardvark%20sort%3Arelevance-rank%20numresults%3A10?page=3 to aardvark/__metadata/hitList3.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/biorxiv/aardvark/aardvark/__metadata/hitList3.clean.html
metadataEntries 10
Results 10
running curl :https://www.biorxiv.org/search/aardvark%20sort%3Arelevance-rank%20numresults%3A10?page=4 to aardvark/__metadata/hitList4.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/biorxiv/aardvark/aardvark/__metadata/hitList4.clean.html
metadataEntries 10
Results 10
[aardvark/__metadata/hitList1.clean.html, aardvark/__metadata/hitList2.clean.html, aardvark/__metadata/hitList3.clean.html, aardvark/__metadata/hitList4.clean.html]
========
HitList: 4
creates hitList[1..4][.clean].html
and <per-ctree>/scrapedMetadata.html
========
download files in hitList aardvark/__metadata/hitList1.clean.html
result set: aardvark/__metadata/hitList1.clean.html
metadataEntries 10
download with curl to <tree>scrapedMetadata.html[/content/10.1101/453662v1, /content/10.1101/450189v3, /content/10.1101/307777v2, /content/10.1101/429571v1, /content/10.1101/260745v4, /content/10.1101/028522v3, /content/10.1101/232991v1, /content/10.1101/164012v2, /content/10.1101/086819v2, /content/10.1101/093096v2]
ndingPages, takes ca 1-5 sec/page
(at this stage you can kill the job, but if you want it will run for another minute and download the papers correctly)
The problem is that the 3 hitLists are in the wrong order. You can see the Encoded URLs have the pagenumbers (I have not found a cursor argument). They are meant to mimic what the human types. From the few tests I have done it seems like for N pages we get:
page 1
page 2
...
page N
page N -1
I'm mystified - you can see the URLs and try yourself (you shouldn't need the escaping).
PLEASE TRY TO REPLICATE
BiorxivDownloadTool retrieves the hitLists correctly but in the wrong order.
AMIDownloadTest.testSmallMultipageDownload shows the problem
running
should retrieve all queries on
aardvark
. There are 21 (it might increase by 1 or 2 if new ones are added ) It should retrieve 3 pages of 10, 10 and 1 hits Instead it gives:(at this stage you can kill the job, but if you want it will run for another minute and download the papers correctly) The problem is that the 3 hitLists are in the wrong order. You can see the Encoded URLs have the pagenumbers (I have not found a cursor argument). They are meant to mimic what the human types. From the few tests I have done it seems like for N pages we get:
I'm mystified - you can see the URLs and try yourself (you shouldn't need the escaping). PLEASE TRY TO REPLICATE