petermr commented 4 years ago

Face masks

@axiomsofchoice has suggested we take "face mask" - do they work? - as our first project; he is also physically making them - 1000 - in Makespace. So here's thoughts on the workflow and tasks.

search/scrape

We should concentrate on Freely readable legal sources - this includes anything pointed to by Unpaywall, readable without login, but NOT SciHub or icanhazpdf. Here's a rough list in rough order:

EuropePMC

The first place to go and default for getpapers. XML and PDF. Nothing needed.

biorxiv and medrxiv

Fulltext in HTML but no API, @petermr has built a prototype scraper but there's a cascade of lazyload and landing page to navigate . HTML-> XML and PDF. Scraper needs updating

Royal Society

Fulltext in PDF. needs a lazy scraper

Theses

Fulltext in PDF. Very valuable as they are additional. BUT multiple sites with arcane landing pages and logins. Aggregated with CORE (UK, may need login - not happy about that), HAL (FR), DARE (NL) Andy Jackson may have better knowledge.

Redalyc

Mexico, but usually in EN. May need lazy loader.

Metadata

The systems should all be converted to create JATS. Most HTML <meta> can be JATS-ised - I have written equivalencers.

TO DO coordinate the metadata extraction.

Body

EuropePMC provide XML which is already catered for.

PDF

PDF needs converting to text, ideally HTML. Full conversion includes formatting, styles, weights which are important in scientific documents. Most pdf-to-html produce flat ASCII test which is highly usable but not perfect. Many tools do not recognize sections.

TO DO check PDF conversion for RoyalSoc
TO DO check PDF for Redalyc?
TO DO check PDF for biorxiv

Html

biorxiv and medrxiv need converting to XHTML (I may already have done this).

Dictionaries

TO DO create relevant dictionaries for face masks. @axiomsofchoice to create some wordlists

Indexing

TO DO Will SOLR index XML or do we need to flatten?

anjackson commented 4 years ago

I'm also thinking of that request that came in looking for evidence of mask effectiveness for typical medical procedures. e.g. should we try to build up the co-occurance matrix for these two dictionaries?

Medical procedures
Terms relating to surgical masks and safetly ('surgical mask', 'n95', 'droplets', 'viruses', 'bacteria' etc.)

petermr commented 4 years ago

That's a lovely Wikipedia page you've found. I'll show how to make the dictionary.

The surgical mask is less clear. We can probably hoover some terms from Wikipedia pages.

On Wed, Apr 1, 2020 at 10:06 PM Andy Jackson notifications@github.com wrote:

I'm also thinking of that request that came in looking for evidence of mask effectiveness for typical medical procedures. e.g. should we try to build up the co-occurance matrix for these two dictionaries?

Medical procedures https://en.wikipedia.org/wiki/Medical_procedure

Terms relating to surgical masks and safetly ('surgical mask', 'n95', 'droplets', 'viruses', 'bacteria' etc.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/30#issuecomment-607489873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS5YA7DHXC2AGL2AS2LRKOUGXANCNFSM4LWRDI6A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr / openVirus

Face Mask knowledge extraction #30