petermr / openVirus

aggregation of scholarly publications and extracted knowledge on viruses and epidemics.
The Unlicense
67 stars 17 forks source link

Scraper for biorxiv and medrxiv #5

Open petermr opened 4 years ago

petermr commented 4 years ago

PMR has already written a scraper but it's not optimal and needs cleaning.

More later

petermr commented 4 years ago

This is organized as a picocli commandline (as is almost all AMI). My current style is to develop new functionalities as Tests, based on commandline and then add this to the JAR. Here's the first test:

    public void testBiorxivSmall() throws Exception {

        File target = new File("target/biorxiv1");
        FileUtils.deleteDirectory(target);
        MatcherAssert.assertThat(target+" does not exist", !target.exists());
        String args = 
                "-p " + target
                + " --site biorxiv" // the type of site 
                + " --query coronavirus" // the query
                + " --pagesize 1" // size of remote pages (may not always work)
                + " --pages 1 1" // number of pages
                + " --resultset raw clean"
                + " --landingpage "
                + " --fulltext html pdf"
//              + " --limit 500"  // total number of downloaded results
            ;
        new AMIDownloadTool().runCommands(args);

This should translate to (where is the local directory and pagesize would normally be larger (e.g. 25)

ami-download -p  <target> --site biorxiv --query coronavirus --pagesize 25 --pages 1 1 \ 
 --resultset raw clean  --landingpage --fulltext html pdf --limit 500

Please try this. And try some of the others. NOTE: some of the test files may be in my local directory and need transferring to src/test/resource/ . This was to save space in the JAR and repo.