Closed ampudia19 closed 1 year ago
It's hard to leave comments on the notebook so I've gone through it and added comments here.
These comments are in order as I review the notebook:
hist_plot
function - I would just add a hashed comment to say it will take that long. I would do the same for the clusters_keywords_plot
function.text_analysis/README.md
or add a comment to where you hash out !spacy download en_core_web_sm to say you will need to unhash this so you can download itAttributeError: Can't pickle local object 'install_spacy_extensions.<locals>.<lambda>'
. I was able to resolve this by not batching i.e. get rid of n_process and batch_size. It took X minutes to run the cell without batch. It took 1 minute 58 seconds to run the cell. You could just use a smaller sample of abstracts to make sure it can run without batch processing. SystemError: initialization of _internal failed without raising an exception
. I've searched and it looks like this is due to a numpy error - i think you need to define numpy as numpy==1.23.5
in requirements.txt
.These comments are overall in terms of content and organisation:
- It takes approx. ~30 seconds to plot publications over time using the
hist_plot
function - I would just add a hashed comment to say it will take that long. I would do the same for theclusters_keywords_plot
function.- [X] I have changed the cells to raw, and wrote a disclaimer on these being slow and to be run at one's risk.
- It looks like you have two sections called the "basics of text analysis" that cover slightly different things i.e. the first is more about innovation mapping and the second is the actual methods. I would rename them to make clear the difference between the two sections.
- [X] I have removed the first label, as it revolves more around data exploration - utils that may be useful for attendees - and doesn't delve on text analysis.
- I love the colored text! I think its a really nice visual for text preprocessing.
- I think you should either add that you need to download spacy's model in
text_analysis/README.md
or add a comment to where you hash out !spacy download en_core_web_sm to say you will need to unhash this so you can download it- [X] Separated the tutorials and added your suggestions to the README.
- I get the following error in the cell after "Let's now do the same at scale. We will use multi-processing with spaCy's pipelines to build a preprocessing pipeline":
AttributeError: Can't pickle local object 'install_spacy_extensions.<locals>.<lambda>'
. I was able to resolve this by not batching i.e. get rid of n_process and batch_size. It took X minutes to run the cell without batch. It took 1 minute 58 seconds to run the cell. You could just use a smaller sample of abstracts to make sure it can run without batch processing.- [X] I've included the disclaimer and made it so that one can just load the preprocessed data from S3.
- honestly long live a word cloud
- When you instantiate the count vectoriser/tf-idf vectorisers, i would make a comment to say you can refer to the documentation to understand what the parameters mean.
- [X] Done.
- Came into the same issue when you build the batch_generator with the preprocesssing pipe for patents - this takes 11 seconds to run the cell so I think it's ok to not use it in the first place.
- [X] Removed arguments, changed object name to avoid confusion.
- small spelling in the k-means section: should be 'in this step' not 'n this step'
- [X] Done.
- the silhouette score for loop is not in a python cell - to convert
- [X] Surprisingly slow, I'll add a comment to convert but prefer to leave it as raw by default.
- when I get to the bertopic section, I get the following error:
SystemError: initialization of _internal failed without raising an exception
. I've searched and it looks like this is due to a numpy error - i think you need to define numpy asnumpy==1.23.5
inrequirements.txt
.- [X] Interesting, I have 1.24.4 and runs fine for me. I've added a heads up comment, will troubleshoot in Estonia if it happens.
- I think there's so much material in the one notebook that I would split it into quite a few different notebooks actually - the basics of text analysis, the innovation mapping exercise in the beginning, the application of a paper etc. This way its not as overwhelming to folks.
- [X] Great, we are in agreement, did this before I read your comment.
- I would also ask more interactive questions i.e. questions like can you refer to spacy documentation and preprocess the text in the same way using this library? This way, there are more questions/tasks for people to do instead of just running cells.
- [X] I've added a few, good tip!
@ampudia19 thanks for the changes, looks great!! I'm going to merge for you then review deep learning
Text processing and topic modelling material for the text analysis workshop.
Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s