phosseini / COVID19-fa

Content analysis of Persian/Farsi Tweets during COVID-19 pandemic in Iran using NLP (NLP-COVID19, EMNLP)
8 stars 3 forks source link

please update the requirements #16

Closed MINIMALaq closed 3 years ago

MINIMALaq commented 3 years ago

Hello, Thank you very much for this repo. I tried to run lda_analysis notebook and it gave me lots of import error. Especially it seems because of removing wrappers from gensim, it make lots of error even when I used gensim==3.8.3 in the 'topic_modeling_gensim.py' you used 'from gensim.models.wrappers import LdaMallet' which needs 'from gensim.models.word2vec import Vocab'. And I couldnt handle this error: 'ImportError: cannot import name 'Vocab' from 'gensim.models.word2vec' . Can you please run this notebook in a virtual env and update the requirement.txt.

phosseini commented 3 years ago

Hello, Thank you very much for this repo. I tried to run lda_analysis notebook and it gave me lots of import error. Especially it seems because of removing wrappers from gensim, it make lots of error even when I used gensim==3.8.3 in the 'topic_modeling_gensim.py' you used 'from gensim.models.wrappers import LdaMallet' which needs 'from gensim.models.word2vec import Vocab'. And I couldnt handle this error: 'ImportError: cannot import name 'Vocab' from 'gensim.models.word2vec' . Can you please run this notebook in a virtual env and update the requirement.txt.

Thanks for bringing this to my attention. I updated the requirements.txt and added the packages used in jupyter notebooks. However, I'm not sure if the error you mentioned will be resolved by/is related to installing the newly added packages in the requirements file. Are you sure you've properly installed Mallet? If not, you may first want to take a look here and here. Please let me know if you still have a problem after installing Mallet (I'll update the readme too to make sure I mention Mallet installation requirmenet).

MINIMALaq commented 3 years ago

The requirement has an issue:

The conflict is caused by: The user requested numpy==1.18.1 gensim 3.8.0 depends on numpy>=1.11.3 yellowbrick 1.1 depends on numpy>=1.13.0 matplotlib 3.1.3 depends on numpy>=1.11 scikit-learn 0.24.2 depends on numpy>=1.13.3 bokeh 2.0.2 depends on numpy>=1.11.3 pyldavis 3.3.1 depends on numpy>=1.20.0

I commented pandas and numpy and it works. in the lda_analysis I changed import pyLDAvis.gensim to import pyLDAvis.gensim_models and it seems work!

Next issue is data/cleaned folder which is not included in the repo. I am wondering if I need to run another code to clean data and fill this folder before using lda_analysis ?

For Mallet I used brew. the binary in mac goes to "/opt/homebrew/Cellar/mallet/2.0.8_1/bin/mallet". I tested it and it works.

phosseini commented 3 years ago

The requirement has an issue:

The conflict is caused by: The user requested numpy==1.18.1 gensim 3.8.0 depends on numpy>=1.11.3 yellowbrick 1.1 depends on numpy>=1.13.0 matplotlib 3.1.3 depends on numpy>=1.11 scikit-learn 0.24.2 depends on numpy>=1.13.3 bokeh 2.0.2 depends on numpy>=1.11.3 pyldavis 3.3.1 depends on numpy>=1.20.0

I commented pandas and numpy and it works. in the lda_analysis I changed import pyLDAvis.gensim to import pyLDAvis.gensim_models and it seems work!

Next issue is data/cleaned folder which is not included in the repo. I am wondering if I need to run another code to clean data and fill this folder before using lda_analysis ?

For Mallet I used brew. the binary in mac goes to "/opt/homebrew/Cellar/mallet/2.0.8_1/bin/mallet". I tested it and it works.

The reason for not including the data/cleaned folder is that it requires preprocessing of the raw tweets that we cannot share for now. However, assuming you have the full json files of tweets (after downloading the tweets using the IDs we have shared), you can put the json files in the data/input folder, then use the methods from PreProcessing class in the pre_processing.py and convert the raw json/excel files of tweets to a cleaned/filtered subset that will be the input for topic modeling.

There are two points to consider when using the preprocessing methods: 1) some of these methods may be based on an older version of Twitter API, so you may want to double-check the names and format of the fields in the tweet object to avoid getting key errors or formatting issues. 2) some of the preprocessing codes are based on the format of the files we receive from the server with which we download the tweets (JSON files are normally the same format as the Twitter API, Excel files might have some minor differences). So, in case you could not run the exact code, you may want to just follow the logic we used for filtering/cleaning the tweets.

MINIMALaq commented 3 years ago

Thanks Pedram jan.