petermr / openVirus

aggregation of scholarly publications and extracted knowledge on viruses and epidemics.
The Unlicense
67 stars 17 forks source link

BiorxivDownload fails to create correct hit lists (URGENT) #47

Open petermr opened 4 years ago

petermr commented 4 years ago

BiorxivDownloadTool retrieves the hitLists correctly but in the wrong order.

AMIDownloadTest.testSmallMultipageDownload shows the problem

running

ami -p aardvark download --site biorxiv --query aardvark --pagesize 10 --pages 1 4 --fulltext html

should retrieve all queries on aardvark . There are 21 (it might increase by 1 or 2 if new ones are added ) It should retrieve 3 pages of 10, 10 and 1 hits Instead it gives:

ami -p aardvark download --site biorxiv --query aardvark --pagesize 10 --pages 1 4 --fulltext html

Generic values (AMIDownloadTool)
================================
-v to see generic values
0    [main] INFO  org.contentmine.ami.tools.AMIDownloadTool  - set output to: scraped/
0 [main] INFO org.contentmine.ami.tools.AMIDownloadTool  - set output to: scraped/
project         aardvark

Specific values (AMIDownloadTool)
================================
fulltext           [html]
limit              20
metadata           metadata
pages              [1, 4]
pagesize           10
query              [aardvark]
hitListList      []
site               biorxiv
file types          []

Query: aardvark%20sort%3Arelevance-rank%20numresults%3A10
URL https://www.biorxiv.org/search/aardvark%20sort%3Arelevance-rank%20numresults%3A10
running curl :https://www.biorxiv.org/search/aardvark%20sort%3Arelevance-rank%20numresults%3A10?page=1 to aardvark/__metadata/hitList1.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/biorxiv/aardvark/aardvark/__metadata/hitList1.clean.html
metadataEntries 10
Results 10
calculating hits NYI
running curl :https://www.biorxiv.org/search/aardvark%20sort%3Arelevance-rank%20numresults%3A10?page=2 to aardvark/__metadata/hitList2.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/biorxiv/aardvark/aardvark/__metadata/hitList2.clean.html
metadataEntries 1
Results 1
running curl :https://www.biorxiv.org/search/aardvark%20sort%3Arelevance-rank%20numresults%3A10?page=3 to aardvark/__metadata/hitList3.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/biorxiv/aardvark/aardvark/__metadata/hitList3.clean.html
metadataEntries 10
Results 10
running curl :https://www.biorxiv.org/search/aardvark%20sort%3Arelevance-rank%20numresults%3A10?page=4 to aardvark/__metadata/hitList4.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/biorxiv/aardvark/aardvark/__metadata/hitList4.clean.html
metadataEntries 10
Results 10
[aardvark/__metadata/hitList1.clean.html, aardvark/__metadata/hitList2.clean.html, aardvark/__metadata/hitList3.clean.html, aardvark/__metadata/hitList4.clean.html]
  ========
HitList: 4
 creates hitList[1..4][.clean].html
 and <per-ctree>/scrapedMetadata.html
========
download files in hitList aardvark/__metadata/hitList1.clean.html
result set: aardvark/__metadata/hitList1.clean.html
metadataEntries 10
download with curl to <tree>scrapedMetadata.html[/content/10.1101/453662v1, /content/10.1101/450189v3, /content/10.1101/307777v2, /content/10.1101/429571v1, /content/10.1101/260745v4, /content/10.1101/028522v3, /content/10.1101/232991v1, /content/10.1101/164012v2, /content/10.1101/086819v2, /content/10.1101/093096v2]
ndingPages, takes ca 1-5 sec/page 

(at this stage you can kill the job, but if you want it will run for another minute and download the papers correctly) The problem is that the 3 hitLists are in the wrong order. You can see the Encoded URLs have the pagenumbers (I have not found a cursor argument). They are meant to mimic what the human types. From the few tests I have done it seems like for N pages we get:

page 1
page 2
...
page N
page N -1

I'm mystified - you can see the URLs and try yourself (you shouldn't need the escaping). PLEASE TRY TO REPLICATE