nestauk / dap_taltech

Tutorials for taltech hack week 2023
MIT License
2 stars 1 forks source link

Close #2 textanalysis #13

Closed ampudia19 closed 1 year ago

ampudia19 commented 1 year ago

Text processing and topic modelling material for the text analysis workshop.

Checklist:

india-kerle commented 1 year ago

It's hard to leave comments on the notebook so I've gone through it and added comments here.

These comments are in order as I review the notebook:

india-kerle commented 1 year ago

These comments are overall in terms of content and organisation:

ampudia19 commented 1 year ago
  • It takes approx. ~30 seconds to plot publications over time using the hist_plot function - I would just add a hashed comment to say it will take that long. I would do the same for the clusters_keywords_plot function.
  • [X] I have changed the cells to raw, and wrote a disclaimer on these being slow and to be run at one's risk.
  • It looks like you have two sections called the "basics of text analysis" that cover slightly different things i.e. the first is more about innovation mapping and the second is the actual methods. I would rename them to make clear the difference between the two sections.
  • [X] I have removed the first label, as it revolves more around data exploration - utils that may be useful for attendees - and doesn't delve on text analysis.
  • I love the colored text! I think its a really nice visual for text preprocessing.
  • I think you should either add that you need to download spacy's model in text_analysis/README.md or add a comment to where you hash out !spacy download en_core_web_sm to say you will need to unhash this so you can download it
  • [X] Separated the tutorials and added your suggestions to the README.
  • I get the following error in the cell after "Let's now do the same at scale. We will use multi-processing with spaCy's pipelines to build a preprocessing pipeline": AttributeError: Can't pickle local object 'install_spacy_extensions.<locals>.<lambda>'. I was able to resolve this by not batching i.e. get rid of n_process and batch_size. It took X minutes to run the cell without batch. It took 1 minute 58 seconds to run the cell. You could just use a smaller sample of abstracts to make sure it can run without batch processing.
  • [X] I've included the disclaimer and made it so that one can just load the preprocessed data from S3.
  • honestly long live a word cloud
  • When you instantiate the count vectoriser/tf-idf vectorisers, i would make a comment to say you can refer to the documentation to understand what the parameters mean.
  • [X] Done.
  • Came into the same issue when you build the batch_generator with the preprocesssing pipe for patents - this takes 11 seconds to run the cell so I think it's ok to not use it in the first place.
  • [X] Removed arguments, changed object name to avoid confusion.
  • small spelling in the k-means section: should be 'in this step' not 'n this step'
  • [X] Done.
  • the silhouette score for loop is not in a python cell - to convert
  • [X] Surprisingly slow, I'll add a comment to convert but prefer to leave it as raw by default.
  • when I get to the bertopic section, I get the following error: SystemError: initialization of _internal failed without raising an exception. I've searched and it looks like this is due to a numpy error - i think you need to define numpy as numpy==1.23.5 in requirements.txt.
  • [X] Interesting, I have 1.24.4 and runs fine for me. I've added a heads up comment, will troubleshoot in Estonia if it happens.
  • I think there's so much material in the one notebook that I would split it into quite a few different notebooks actually - the basics of text analysis, the innovation mapping exercise in the beginning, the application of a paper etc. This way its not as overwhelming to folks.
  • [X] Great, we are in agreement, did this before I read your comment.
  • I would also ask more interactive questions i.e. questions like can you refer to spacy documentation and preprocess the text in the same way using this library? This way, there are more questions/tasks for people to do instead of just running cells.
  • [X] I've added a few, good tip!
india-kerle commented 1 year ago

@ampudia19 thanks for the changes, looks great!! I'm going to merge for you then review deep learning