Checklist:

[X] I have refactored my code out from notebooks/
[X] I have checked the code runs
[X] I have tested the code
[X] I have run pre-commit and addressed any issues not automatically fixed
[X] I have merged any new changes from dev
[X] I have documented the code
- [X] Major functions have docstrings
- [X] Appropriate information has been added to READMEs
[X] I have explained this PR above
[X] I have requested a code review

india-kerle commented 1 year ago

It's hard to leave comments on the notebook so I've gone through it and added comments here.

These comments are in order as I review the notebook:

It takes approx. ~30 seconds to plot publications over time using the hist_plot function - I would just add a hashed comment to say it will take that long. I would do the same for the clusters_keywords_plot function.
It looks like you have two sections called the "basics of text analysis" that cover slightly different things i.e. the first is more about innovation mapping and the second is the actual methods. I would rename them to make clear the difference between the two sections.
I love the colored text! I think its a really nice visual for text preprocessing.
I think you should either add that you need to download spacy's model in text_analysis/README.md or add a comment to where you hash out !spacy download en_core_web_sm to say you will need to unhash this so you can download it
I get the following error in the cell after "Let's now do the same at scale. We will use multi-processing with spaCy's pipelines to build a preprocessing pipeline": AttributeError: Can't pickle local object 'install_spacy_extensions.<locals>.<lambda>'. I was able to resolve this by not batching i.e. get rid of n_process and batch_size. It took X minutes to run the cell without batch. It took 1 minute 58 seconds to run the cell. You could just use a smaller sample of abstracts to make sure it can run without batch processing.
honestly long live a word cloud
When you instantiate the count vectoriser/tf-idf vectorisers, i would make a comment to say you can refer to the documentation to understand what the parameters mean.
Came into the same issue when you build the batch_generator with the preprocesssing pipe for patents - this takes 11 seconds to run the cell so I think it's ok to not use it in the first place.
small spelling in the k-means section: should be 'in this step' not 'n this step'
the silhouette score for loop is not in a python cell - to convert
when I get to the bertopic section, I get the following error: SystemError: initialization of _internal failed without raising an exception. I've searched and it looks like this is due to a numpy error - i think you need to define numpy as numpy==1.23.5 in requirements.txt.

india-kerle commented 1 year ago

These comments are overall in terms of content and organisation:

I think there's so much material in the one notebook that I would split it into quite a few different notebooks actually - the basics of text analysis, the innovation mapping exercise in the beginning, the application of a paper etc. This way its not as overwhelming to folks.
I would also ask more interactive questions i.e. questions like can you refer to spacy documentation and preprocess the text in the same way using this library? This way, there are more questions/tasks for people to do instead of just running cells.
Its super detailed and really impressive - i think more than enough material for people to use code for their respective projects !! I've honestly learned a ton just going through the notebook myself.

ampudia19 commented 1 year ago

It takes approx. ~30 seconds to plot publications over time using the hist_plot function - I would just add a hashed comment to say it will take that long. I would do the same for the clusters_keywords_plot function.

[X] I have changed the cells to raw, and wrote a disclaimer on these being slow and to be run at one's risk.

It looks like you have two sections called the "basics of text analysis" that cover slightly different things i.e. the first is more about innovation mapping and the second is the actual methods. I would rename them to make clear the difference between the two sections.

[X] I have removed the first label, as it revolves more around data exploration - utils that may be useful for attendees - and doesn't delve on text analysis.

I love the colored text! I think its a really nice visual for text preprocessing.

I think you should either add that you need to download spacy's model in text_analysis/README.md or add a comment to where you hash out !spacy download en_core_web_sm to say you will need to unhash this so you can download it

[X] Separated the tutorials and added your suggestions to the README.

I get the following error in the cell after "Let's now do the same at scale. We will use multi-processing with spaCy's pipelines to build a preprocessing pipeline": AttributeError: Can't pickle local object 'install_spacy_extensions.<locals>.<lambda>'. I was able to resolve this by not batching i.e. get rid of n_process and batch_size. It took X minutes to run the cell without batch. It took 1 minute 58 seconds to run the cell. You could just use a smaller sample of abstracts to make sure it can run without batch processing.

[X] I've included the disclaimer and made it so that one can just load the preprocessed data from S3.

honestly long live a word cloud

When you instantiate the count vectoriser/tf-idf vectorisers, i would make a comment to say you can refer to the documentation to understand what the parameters mean.

[X] Done.

Came into the same issue when you build the batch_generator with the preprocesssing pipe for patents - this takes 11 seconds to run the cell so I think it's ok to not use it in the first place.

[X] Removed arguments, changed object name to avoid confusion.

small spelling in the k-means section: should be 'in this step' not 'n this step'

[X] Done.

the silhouette score for loop is not in a python cell - to convert

[X] Surprisingly slow, I'll add a comment to convert but prefer to leave it as raw by default.

when I get to the bertopic section, I get the following error: SystemError: initialization of _internal failed without raising an exception. I've searched and it looks like this is due to a numpy error - i think you need to define numpy as numpy==1.23.5 in requirements.txt.

[X] Interesting, I have 1.24.4 and runs fine for me. I've added a heads up comment, will troubleshoot in Estonia if it happens.

I think there's so much material in the one notebook that I would split it into quite a few different notebooks actually - the basics of text analysis, the innovation mapping exercise in the beginning, the application of a paper etc. This way its not as overwhelming to folks.

[X] Great, we are in agreement, did this before I read your comment.

I would also ask more interactive questions i.e. questions like can you refer to spacy documentation and preprocess the text in the same way using this library? This way, there are more questions/tasks for people to do instead of just running cells.

[X] I've added a few, good tip!

india-kerle commented 1 year ago

@ampudia19 thanks for the changes, looks great!! I'm going to merge for you then review deep learning

nestauk / dap_taltech

Close #2 textanalysis #13

Checklist: