mmonakho / Sea-Ice-Change

1 stars 1 forks source link

Analysis / methods tutorials and scripts #7

Open kaclaborn opened 3 years ago

kaclaborn commented 3 years ago

Great resources on collocation analysis and word counts, etc. using R: https://slcladal.github.io/coll.html

How to convert pdf to text in R: https://slcladal.github.io/convertpdf2txt.html

kaclaborn commented 3 years ago

@KristaLawless @mmonakho I have processed 1429 PDF documents from a Nexis Lexis search on "sea ice" AND "Alaska" within Alaska Dispatch News source only. Downloading these documents was a tiny pain, as I ran into the same issue Masha did with only being able to download 1000. This issue is actually related to referencing the search item number when downloading (i.e., when you have a list of 1500 search results and try to download them 100 at a time, you're only allowed to reference the first 1000 list items when downloading -- no idea why).

The workaround was to split the results into two groups of less-than-1000 items. So, I divided them by all results prior to 1/1/2014, and all results after 12/31/2013. Then, I could download them all in 100-batch groups.

Next, I had to remove a fair number of duplicates from our sample (despite using the "Group Duplicates" function in Nexis Lexis that Masha showed us! Anyway, all of the pre-processed PDFs from this query are in our shared Dropbox folder --> Corpus --> pre-processed.

kaclaborn commented 3 years ago

I have also processed all of the above PDFs into cleaned text files, using the script corpusPreProcess.R. I did this by copying a local version of the pre-process folder from Dropbox into my local clone of the GitHub repository (within the folders data/corpus). Then, I output all of the files as text documents back into the same folder on my local device, and copied that folder (named post-process) into our shared Dropbox. These files represent our clean corpus ready to be analyzed! Again, great resources that I found for doing some text analysis in R are linked at the top of this issue.

kaclaborn commented 3 years ago

Next step for me will be to pull out the date/year for each document, so we can see changes through time (between 1995 and 2021).

mmonakho commented 3 years ago

This is amazing!! And how smart of you to split the search into 2 based on dates! I did not think of that.

On Sat, Oct 2, 2021 at 12:10 PM Kelly Claborn @.***> wrote:

Next step for me will be to pull out the date/year for each document, so we can see changes through time (between 1995 and 2021).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/mmonakho/Sea-Ice-Change/issues/7*issuecomment-932806051__;Iw!!IKRxdwAv5BmarQ!PLzwI0V4w1nj7eCIeOcn0QYxiF7potkTD534zlF4MJFkX7njvaCQk7xgWF7gj_U$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AU5H4RMI2S3PYURSTKDFE3DUE5KKXANCNFSM5E6ASWYQ__;!!IKRxdwAv5BmarQ!PLzwI0V4w1nj7eCIeOcn0QYxiF7potkTD534zlF4MJFkX7njvaCQk7xg84-mnGM$ . Triage notifications on the go with GitHub Mobile for iOS https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!IKRxdwAv5BmarQ!PLzwI0V4w1nj7eCIeOcn0QYxiF7potkTD534zlF4MJFkX7njvaCQk7xg31xzNrI$ or Android https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!IKRxdwAv5BmarQ!PLzwI0V4w1nj7eCIeOcn0QYxiF7potkTD534zlF4MJFkX7njvaCQk7xgOw-K244$.

kaclaborn commented 3 years ago

Another word count tutorial: https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/ -- I think this tutorial uses .pdf files, but might work for .txt files too..

kaclaborn commented 3 years ago

@KristaLawless @mmonakho OK, I have pushed the metadataExtract.R updates to the repo, and have added the lemma data and stopwords data to our shared Dropbox folder, where you can copy and paste into your locally cloned repo (in data/resources folder).

As we discussed, here are next steps for each of us, following from this awesome tutorial:

  1. Frequency analysis by year (Tutorials 3-4) -- KELLY
    • Bigrams by year of top 5-10 bigrams (as defined across our entire corpus)
    • Collocations of length 3 by year (potentially zoning in on sea ice retreat, recede, loss, etc.)
    • Visualizations
  2. Co-occurrences (Tutorial 5) -- KRISTA
    • NOTE: In the text processing phase: need to add two new lines of processing each using the tokens_replace() function, to change "sea ice" --> "sea-ice" and "climate change" to "climate-change" so that we can explore single word co-occurrences for these high-frequency bigrams.
    • Then, explore how "sea-ice" and "climate-change" co-occur with other words, and do some statistical analyses to see if the highest frequency co-occurrences are happening at a greater frequency than by chance.
  3. Topic modeling (Tutorial 6) - MASHA
    • I don't know enough about topic modeling to know exactly what we want to explore here, beyond simply trying to identify any broader topic clusterings! It looks like we could do a word cloud of topics, etc. in addition to other analyses.

NOTE: in the metadataExtract.R code, "ADN" is equivalent to textdata from the tutorials, and "ADNcorpus" is equivalent to sotu_corpus in the tutorials.

KristaLawless commented 3 years ago

@mmonakho @kaclaborn

Notes from 10/19/21

Analysis: 1-5 (@kaclaborn)

  1. number of articles per year through time
  2. word count --> bigrams as a function of the number of documents in 3 year increments
  3. number of bigrams over the number of documents
  4. choose most common bigrams as a ratio
  5. trigrams
  6. topic modeling (@mmonakho)
  7. co-occurrence (@KristaLawless)

Next steps:

  1. interpret analyses (all of us)
  2. build upon outline for rough draft due 11/9/21 (@mmonakho @KristaLawless)

    exploratory research questions What is the prevalence of regional newspaper talking about sea ice over the last 26 years? How has this prevalence changed over time? number of documents --> how often is sea ice being talked about?

mmonakho commented 3 years ago

Pretty interesting results on the main 20 topics... any thoughts on interpretation?

Topics_all

KristaLawless commented 3 years ago

What stop words did you include?

On Tue, Oct 19, 2021 at 3:30 PM mmonakho @.***> wrote:

Pretty interesting results on the main 20 topics... any thoughts on interpretation? [image: Topics1] https://urldefense.com/v3/__https://user-images.githubusercontent.com/87719493/137999424-70d193ad-49d0-474b-8da8-4083a4c46971.JPG__;!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9DUCvnBCA$ [image: Topics2] https://urldefense.com/v3/__https://user-images.githubusercontent.com/87719493/137999425-7361b621-8ba3-4f82-8219-5119cd56b4fa.jpg__;!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9D1szrLe4$ [image: Topics2] https://urldefense.com/v3/__https://user-images.githubusercontent.com/87719493/137999428-d8e357b9-272d-41fa-9c53-bb0e26008ff3.jpg__;!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9D5qH_RqU$ [image: Topics1] https://urldefense.com/v3/__https://user-images.githubusercontent.com/87719493/137999430-6cb2ca0b-b43a-4840-ac9e-d984b7c07e54.JPG__;!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9D4Dkjc_M$

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/mmonakho/Sea-Ice-Change/issues/7*issuecomment-947153744__;Iw!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9DAxcWuQo$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AVVMXNI6U6ZT2A32QLZFFADUHXWPLANCNFSM5E6ASWYQ__;!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9DldUhS90$ . Triage notifications on the go with GitHub Mobile for iOS https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9DTfOuqM0$ or Android https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!IKRxdwAv5BmarQ!MuHWwuEZOTRmErG4FWMCNNRvFmGmcZESs7d9QNASoAAZ1x91kSxIzb9DqUViX4w$.

-- Krista Lawless

mmonakho commented 3 years ago

The paper on topic modelling (LDA method) URL: https://ai.stanford.edu/~ang/papers/jair03-lda.pdf