beingkk commented 1 year ago

Closes #54

This is a first PR for novelty measurement component of the project.

It contributes:

Utils file dap_aria_mapping/utils/novelty_utils.py
Script to calculate novelty scores for OpenAlex papers dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py

The usage examples of the script are:

All levels, full dataset (takes about 30 mins)

python dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py

One level, test dataset (a few seconds)

python dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py --taxonomy-level 1 --test

The outputs for each taxonomy level are:

A table with all documents and their novelty score
A table with all topic pairs and the years they have occurred, and the commonness score for that topic pair and year

The outputs are presently stored locally. Happy to create new issues for adding storing on s3, and any other improvements you'd like to suggest.

Also let me know if I need to write tests - would be happy to leave that for another issue to move on with generating results.

For more context: The novelty score is calculated using the approach described in this paper (Lee et al 2015). This so-called "U-measure" has been shown (Bornmann et al 2019) to have some correlation with what researchers consider novel papers. However, note that I have adapted it for combinations of topics - whereas originally it has been used for combinations of citations/cited journals. This will likely create some challenges (eg, citations in a way reflect the full content of the paper, whereas abstracts are much shorter)

From Lee et al 2015:

From Bornmann et al 2019

The results for novelty score U are (mostly) in agreement with our expectations concerning the results for the different tags. We found, for instance, that for a standard deviation increase in novelty score U, the expected number of assignments of the “new finding” tag increases by 7.47% (the result is statistically significant). The results further show that this indicator seems to be especially suited to identifying papers suggesting new targets for drug discovery

Checklist:

[x] I have refactored my code out from notebooks/
[x] I have checked the code runs
[ ] I have tested the code
[x] I have run pre-commit and addressed any issues not automatically fixed
[x] I have merged any new changes from dev
[x] I have documented the code
- [x] Major functions have docstrings
- [ ] Appropriate information has been added to READMEs
[x] I have explained this PR above
[x] I have requested a code review

beingkk commented 1 year ago

Thanks very much @emily-bicks, I will aim to adjust according to your comments and resubmit by end of week or Monday.

ampudia19 commented 1 year ago

Very nice PR Karlis, I haven't had a chance at running it (my laptop is hovering over 90% CPU & Mem usage unfortunately), but I can't immediately see anything wrong.

A question I do have is where you plan to take this. If you aim to recreate the paper measure, ie. not pairs of topics (or journals) but rather pairs of referenced topics (or journals), iterrowing pandas DFs is not going to work, and you'll have to resort either to parallelizable approaches, polars (no idea how this one works), or vectorizing your ops. I'm sure you've given this some thought, but I still want to flag it while we have time.

Being able to recreate the Lee 2015 measure would be nice for validation (as we can then claim Bornmann's argument that it roughly works). It would also serve as the perfect baseline for using our simpler (no reference papers, only use within-paper topic combs) approach, in a way that we can potentially check how much we lose as we ignore citation info.

beingkk commented 1 year ago

Thanks very much @ampudia19! I will aim to incorporate the easy to do suggestions in this PR by end of Monday.

In terms of where to take this next, I'd like to approach this iteratively, with developing the minimum viable product and building on top of it as much as time allows.

The present implementation is Step 1 - I think we can see it as an almost ready MVP, as we can use it to spot "uncommon" combinations of taxonomy topics.

Step 2: I suspect it will still need some filtering to remove noise (eg, low frequency, non-interesting, random combinations of topics) and light sense-checking (eg, browsing the most and least "novel" papers and seeing if it seems to make sense)

Step 3: I would like to focus on aggregating the novelty scores to provide useful input into the dashboard (will need to discuss the details during our catch up)

Step 4: Apply the same pipeline to patents.

Step 5: Only then consider improvements or changes to the novelty score. I think trying re-implementing the citation-based version would be a good bet - would be interesting to compare the results with topic-based version (if we can get the journal names of the cited papers). In terms of computational cost, if it's an order of magnitude increase then it might be still doable with the present implementation.

I think the other option you suggest, to use the abstracts of the cited papers and detect topics in those will be a bigger challenge (lots more data, more optimisation) - I'm doubtful I'll be able to complete that by end of March...

beingkk commented 1 year ago

Hey @ampudia19 and @emily-bicks, thanks again for the quick review!

I've responded to most of your comments (see below). I haven't changed from typer to argparse - if you insist, perhaps I can create a new issue and address it a bit later? If yes, than happy to merge now.

[x] Saving the outputs in S3 and then write getters
[x] Moving the script out of the notebook directory and into pipelines
[x] Adding a command line option for --save-to-local to save locally
[x] Typing
[x] Missing "Returns" in docstrings
[x] More informative function names
[x] Briefly describing the method / calculation, for code legacy sake
[x] Saving everything as parquets

emily-bicks commented 1 year ago

Merge away!

On Fri, 3 Mar 2023 at 11:45, Karlis Kanders @.***> wrote:

Hey @ampudia19 https://github.com/ampudia19 and @emily-bicks https://github.com/emily-bicks, thanks again for the quick review!

I've responded to most of your comments (see below). I haven't changed from typer to argparse - if you insist, perhaps I can create a new issue and address it a bit later? If yes, than happy to merge now.

Saving the outputs in S3 and then write getters

Moving the script out of the notebook directory and into pipelines

Adding a command line option for --save-to-local to save locally

Typing

Missing "Returns" in docstrings

More informative function names

Briefly describing the method / calculation, for code legacy sake

Saving everything as parquets

— Reply to this email directly, view it on GitHub https://github.com/nestauk/dap_aria_mapping/pull/59#issuecomment-1453410526, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3ANJEV7M4DNVVBADFZP6VTW2HKV7ANCNFSM6AAAAAAVDLWUCE . You are receiving this because you were mentioned.Message ID: @.***>

--

Emily Bicks | Principal Data Scientist, Data Analytics Practice Pronouns: she/her

--

58 Victoria Embankment London EC4Y 0DS

            www.nesta.org.uk

http://www.nesta.org.uk/ | Twitter http://www.twitter.com/nesta_uk | LinkedIn http://www.linkedin.com/groups?gid=1868227&goback=%2Egdr_1274367066783_1 | Facebook http://www.facebook.com/pages/NESTA/116788428355432?v=wall&ref=sgm

If you no longer want to receive emails from Nesta, send an email to @. @.>. This email and any attachments are confidential and may be subject to legal privilege. Any use, copying or disclosure other than by the intended recipient is unauthorised. If you have received this message in error, please notify the sender immediately or by email to @. @.> and delete this message and any
copies from your computer and network. The views expressed in this email are those of the author and do not necessarily reflect the views of Nesta. Nesta is a company limited by guarantee and registered in England and Wales with company number 7706036 and charity number

Registered as a charity in Scotland number SC042833. Registered office: 58 Victoria Embankment, London, EC4Y 0DS.

--

nestauk / dap_aria_mapping

54 measuring novelty #59

Checklist:

--