Closed beingkk closed 1 year ago
Thanks very much @emily-bicks, I will aim to adjust according to your comments and resubmit by end of week or Monday.
Very nice PR Karlis, I haven't had a chance at running it (my laptop is hovering over 90% CPU & Mem usage unfortunately), but I can't immediately see anything wrong.
A question I do have is where you plan to take this. If you aim to recreate the paper measure, ie. not pairs of topics (or journals) but rather pairs of referenced topics (or journals), iterrowing pandas DFs is not going to work, and you'll have to resort either to parallelizable approaches, polars (no idea how this one works), or vectorizing your ops. I'm sure you've given this some thought, but I still want to flag it while we have time.
Being able to recreate the Lee 2015 measure would be nice for validation (as we can then claim Bornmann's argument that it roughly works). It would also serve as the perfect baseline for using our simpler (no reference papers, only use within-paper topic combs) approach, in a way that we can potentially check how much we lose as we ignore citation info.
Thanks very much @ampudia19! I will aim to incorporate the easy to do suggestions in this PR by end of Monday.
In terms of where to take this next, I'd like to approach this iteratively, with developing the minimum viable product and building on top of it as much as time allows.
The present implementation is Step 1 - I think we can see it as an almost ready MVP, as we can use it to spot "uncommon" combinations of taxonomy topics.
Step 2: I suspect it will still need some filtering to remove noise (eg, low frequency, non-interesting, random combinations of topics) and light sense-checking (eg, browsing the most and least "novel" papers and seeing if it seems to make sense)
Step 3: I would like to focus on aggregating the novelty scores to provide useful input into the dashboard (will need to discuss the details during our catch up)
Step 4: Apply the same pipeline to patents.
Step 5: Only then consider improvements or changes to the novelty score. I think trying re-implementing the citation-based version would be a good bet - would be interesting to compare the results with topic-based version (if we can get the journal names of the cited papers). In terms of computational cost, if it's an order of magnitude increase then it might be still doable with the present implementation.
I think the other option you suggest, to use the abstracts of the cited papers and detect topics in those will be a bigger challenge (lots more data, more optimisation) - I'm doubtful I'll be able to complete that by end of March...
Hey @ampudia19 and @emily-bicks, thanks again for the quick review!
I've responded to most of your comments (see below). I haven't changed from typer to argparse - if you insist, perhaps I can create a new issue and address it a bit later? If yes, than happy to merge now.
Merge away!
On Fri, 3 Mar 2023 at 11:45, Karlis Kanders @.***> wrote:
Hey @ampudia19 https://github.com/ampudia19 and @emily-bicks https://github.com/emily-bicks, thanks again for the quick review!
I've responded to most of your comments (see below). I haven't changed from typer to argparse - if you insist, perhaps I can create a new issue and address it a bit later? If yes, than happy to merge now.
- Saving the outputs in S3 and then write getters
- Moving the script out of the notebook directory and into pipelines
- Adding a command line option for --save-to-local to save locally
- Typing
- Missing "Returns" in docstrings
- More informative function names
- Briefly describing the method / calculation, for code legacy sake
- Saving everything as parquets
— Reply to this email directly, view it on GitHub https://github.com/nestauk/dap_aria_mapping/pull/59#issuecomment-1453410526, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3ANJEV7M4DNVVBADFZP6VTW2HKV7ANCNFSM6AAAAAAVDLWUCE . You are receiving this because you were mentioned.Message ID: @.***>
Emily Bicks | Principal Data Scientist, Data Analytics Practice Pronouns: she/her
--
58 Victoria Embankment London EC4Y 0DS
www.nesta.org.uk
http://www.nesta.org.uk/ | Twitter http://www.twitter.com/nesta_uk | LinkedIn http://www.linkedin.com/groups?gid=1868227&goback=%2Egdr_1274367066783_1 | Facebook http://www.facebook.com/pages/NESTA/116788428355432?v=wall&ref=sgm
If
you no longer want to receive emails from Nesta, send an email to
@. @.>. This email and any
attachments are confidential and may be subject to legal privilege. Any
use, copying or disclosure other than by the intended recipient is
unauthorised. If you have received this message in error, please notify
the sender immediately or by email to @.
@.> and delete this message and any
copies from your computer and network. The views expressed in this
email are those of the author and do not necessarily reflect the views
of Nesta. Nesta is a company limited by guarantee and registered in
England and Wales with company number 7706036 and charity number
--
Closes #54
This is a first PR for novelty measurement component of the project.
It contributes:
dap_aria_mapping/utils/novelty_utils.py
dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py
The usage examples of the script are:
All levels, full dataset (takes about 30 mins)
One level, test dataset (a few seconds)
The outputs for each taxonomy level are:
The outputs are presently stored locally. Happy to create new issues for adding storing on s3, and any other improvements you'd like to suggest.
Also let me know if I need to write tests - would be happy to leave that for another issue to move on with generating results.
For more context: The novelty score is calculated using the approach described in this paper (Lee et al 2015). This so-called "U-measure" has been shown (Bornmann et al 2019) to have some correlation with what researchers consider novel papers. However, note that I have adapted it for combinations of topics - whereas originally it has been used for combinations of citations/cited journals. This will likely create some challenges (eg, citations in a way reflect the full content of the paper, whereas abstracts are much shorter)
From Lee et al 2015:
From Bornmann et al 2019
Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s