Open petermr opened 4 years ago
I'm also thinking of that request that came in looking for evidence of mask effectiveness for typical medical procedures. e.g. should we try to build up the co-occurance matrix for these two dictionaries?
That's a lovely Wikipedia page you've found. I'll show how to make the dictionary.
The surgical mask is less clear. We can probably hoover some terms from Wikipedia pages.
On Wed, Apr 1, 2020 at 10:06 PM Andy Jackson notifications@github.com wrote:
I'm also thinking of that request that came in looking for evidence of mask effectiveness for typical medical procedures. e.g. should we try to build up the co-occurance matrix for these two dictionaries?
- Medical procedures https://en.wikipedia.org/wiki/Medical_procedure
- Terms relating to surgical masks and safetly ('surgical mask', 'n95', 'droplets', 'viruses', 'bacteria' etc.)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/30#issuecomment-607489873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS5YA7DHXC2AGL2AS2LRKOUGXANCNFSM4LWRDI6A .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Face masks
@axiomsofchoice has suggested we take "face mask" - do they work? - as our first project; he is also physically making them - 1000 - in Makespace. So here's thoughts on the workflow and tasks.
search/scrape
We should concentrate on Freely readable legal sources - this includes anything pointed to by Unpaywall, readable without login, but NOT SciHub or
icanhazpdf
. Here's a rough list in rough order:EuropePMC
The first place to go and default for
getpapers
. XML and PDF. Nothing needed.biorxiv and medrxiv
Fulltext in HTML but no API, @petermr has built a prototype scraper but there's a cascade of lazyload and landing page to navigate . HTML-> XML and PDF. Scraper needs updating
Royal Society
Fulltext in PDF. needs a lazy scraper
Theses
Fulltext in PDF. Very valuable as they are additional. BUT multiple sites with arcane landing pages and logins. Aggregated with CORE (UK, may need login - not happy about that), HAL (FR), DARE (NL) Andy Jackson may have better knowledge.
Redalyc
Mexico, but usually in EN. May need lazy loader.
Metadata
The systems should all be converted to create JATS. Most HTML
<meta>
can be JATS-ised - I have written equivalencers.TO DO coordinate the metadata extraction.
Body
EuropePMC provide XML which is already catered for.
PDF
PDF needs converting to text, ideally HTML. Full conversion includes formatting, styles, weights which are important in scientific documents. Most pdf-to-html produce flat ASCII test which is highly usable but not perfect. Many tools do not recognize sections.
Html
biorxiv and medrxiv need converting to XHTML (I may already have done this).
Dictionaries
TO DO create relevant dictionaries for face masks. @axiomsofchoice to create some wordlists
Indexing