Open kaclaborn opened 3 years ago
@KristaLawless @mmonakho I have processed 1429 PDF documents from a Nexis Lexis search on "sea ice" AND "Alaska" within Alaska Dispatch News source only. Downloading these documents was a tiny pain, as I ran into the same issue Masha did with only being able to download 1000. This issue is actually related to referencing the search item number when downloading (i.e., when you have a list of 1500 search results and try to download them 100 at a time, you're only allowed to reference the first 1000 list items when downloading -- no idea why).
The workaround was to split the results into two groups of less-than-1000 items. So, I divided them by all results prior to 1/1/2014, and all results after 12/31/2013. Then, I could download them all in 100-batch groups.
Next, I had to remove a fair number of duplicates from our sample (despite using the "Group Duplicates" function in Nexis Lexis that Masha showed us! Anyway, all of the pre-processed PDFs from this query are in our shared Dropbox folder --> Corpus --> pre-processed.
I have also processed all of the above PDFs into cleaned text files, using the script corpusPreProcess.R. I did this by copying a local version of the pre-process folder from Dropbox into my local clone of the GitHub repository (within the folders data/corpus). Then, I output all of the files as text documents back into the same folder on my local device, and copied that folder (named post-process) into our shared Dropbox. These files represent our clean corpus ready to be analyzed! Again, great resources that I found for doing some text analysis in R are linked at the top of this issue.
Next step for me will be to pull out the date/year for each document, so we can see changes through time (between 1995 and 2021).
This is amazing!! And how smart of you to split the search into 2 based on dates! I did not think of that.
On Sat, Oct 2, 2021 at 12:10 PM Kelly Claborn @.***> wrote:
Next step for me will be to pull out the date/year for each document, so we can see changes through time (between 1995 and 2021).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/mmonakho/Sea-Ice-Change/issues/7*issuecomment-932806051__;Iw!!IKRxdwAv5BmarQ!PLzwI0V4w1nj7eCIeOcn0QYxiF7potkTD534zlF4MJFkX7njvaCQk7xgWF7gj_U$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AU5H4RMI2S3PYURSTKDFE3DUE5KKXANCNFSM5E6ASWYQ__;!!IKRxdwAv5BmarQ!PLzwI0V4w1nj7eCIeOcn0QYxiF7potkTD534zlF4MJFkX7njvaCQk7xg84-mnGM$ . Triage notifications on the go with GitHub Mobile for iOS https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!IKRxdwAv5BmarQ!PLzwI0V4w1nj7eCIeOcn0QYxiF7potkTD534zlF4MJFkX7njvaCQk7xg31xzNrI$ or Android https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!IKRxdwAv5BmarQ!PLzwI0V4w1nj7eCIeOcn0QYxiF7potkTD534zlF4MJFkX7njvaCQk7xgOw-K244$.
Another word count tutorial: https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/ -- I think this tutorial uses .pdf files, but might work for .txt files too..
@KristaLawless @mmonakho OK, I have pushed the metadataExtract.R updates to the repo, and have added the lemma data and stopwords data to our shared Dropbox folder, where you can copy and paste into your locally cloned repo (in data/resources folder).
As we discussed, here are next steps for each of us, following from this awesome tutorial:
NOTE: in the metadataExtract.R code, "ADN" is equivalent to textdata from the tutorials, and "ADNcorpus" is equivalent to sotu_corpus in the tutorials.
@mmonakho @kaclaborn
Notes from 10/19/21
Analysis: 1-5 (@kaclaborn)
Next steps:
exploratory research questions What is the prevalence of regional newspaper talking about sea ice over the last 26 years? How has this prevalence changed over time? number of documents --> how often is sea ice being talked about?
Pretty interesting results on the main 20 topics... any thoughts on interpretation?
What stop words did you include?
On Tue, Oct 19, 2021 at 3:30 PM mmonakho @.***> wrote:
Pretty interesting results on the main 20 topics... any thoughts on interpretation? [image: Topics1] https://urldefense.com/v3/__https://user-images.githubusercontent.com/87719493/137999424-70d193ad-49d0-474b-8da8-4083a4c46971.JPG__;!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9DUCvnBCA$ [image: Topics2] https://urldefense.com/v3/__https://user-images.githubusercontent.com/87719493/137999425-7361b621-8ba3-4f82-8219-5119cd56b4fa.jpg__;!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9D1szrLe4$ [image: Topics2] https://urldefense.com/v3/__https://user-images.githubusercontent.com/87719493/137999428-d8e357b9-272d-41fa-9c53-bb0e26008ff3.jpg__;!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9D5qH_RqU$ [image: Topics1] https://urldefense.com/v3/__https://user-images.githubusercontent.com/87719493/137999430-6cb2ca0b-b43a-4840-ac9e-d984b7c07e54.JPG__;!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9D4Dkjc_M$
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/mmonakho/Sea-Ice-Change/issues/7*issuecomment-947153744__;Iw!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9DAxcWuQo$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AVVMXNI6U6ZT2A32QLZFFADUHXWPLANCNFSM5E6ASWYQ__;!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9DldUhS90$ . Triage notifications on the go with GitHub Mobile for iOS https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9DTfOuqM0$ or Android https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9DqUViX4w$.
-- Krista Lawless
The paper on topic modelling (LDA method) URL: https://ai.stanford.edu/~ang/papers/jair03-lda.pdf
Great resources on collocation analysis and word counts, etc. using R: https://slcladal.github.io/coll.html
How to convert pdf to text in R: https://slcladal.github.io/convertpdf2txt.html