leonawicz / rtrek

R package for Star Trek datasets and related R functions.
https://leonawicz.github.io/rtrek/
Other
53 stars 4 forks source link

Datasets: derived from text mining #1

Closed leonawicz closed 3 years ago

leonawicz commented 6 years ago

Add datasets to rtrek representing summaries of interesting variables culled from text mining of Star Trek e-books and other written Star Trek content like television episode screenplay transcripts.

tylerlittlefield commented 5 years ago

Hi @leonawicz I've started parsing Star Trek transcripts here (currently just TNG) and thought I'd let you know in case you are interested. Before doing this I googled to see if anyone else already did this and stumbled upon this issue. I've quickly thrown the repository on github but will polish it up later tonight.

leonawicz commented 5 years ago

Hi @tyluRp thanks for reaching out. I just took a look at your startrek package and it looks like very useful data. Very cool! I know these transcripts are available online, it's just been a while since I've looked them up. I like the tidy data frame approach to organizing the episode scripts. I'd eventually do the same kind of thing. The approach I take with rtrek though is to access a lot of the larger datasets remotely by constructed API-like calls. Otherwise the package ends up with too large a size due to the amount of data. I think I would do the same for episodes, where the requested data is pulled in from a remote source rather than stored in the package (assuming a dependable source).

Since the data prep for rtrek covers a lot of different things, I've put a lot of the prep for rtrek datasets into trekdata, just to keep rtrek/data-raw from getting too messy and out of control. trekdata is basically just an abstracted addition to data-raw. It serves no other purpose.

Between the total data size and the licensing/copyright associated with the raw transcripts (I'm no expert on such a topic), I could see accessing the texts from an online source, but not copying them directly into a published package. But anyway on data size alone for rtrek, I imagine I'd stick to remote data access.

If you want to discuss any R/Trek stuff feel free to drop me an email at my maintainer address.

Also, I just saw your lisa package. That's really awesome. I was going to say you should see about getting it added to paletteer but I see it already has been requested: https://github.com/EmilHvitfeldt/paletteer/issues/35

tylerlittlefield commented 5 years ago

Thanks for all the info, I wasn't aware of trekdata, that's a cool idea. Regarding the copyright stuff, that is a fuzzy topic for me as well. I wonder if parsing these scripts to a tidy data format and hosting them on GitHub is illegal? The beginning of each script reads:

This script is not for publicaion or reproduction. No one is authorized to dispose of the same. If lost or destroyed, please notify the Script Department.

I guess I should start learning about this stuff (or at least try) 🤔

leonawicz commented 5 years ago

Yeah I saw that as well.

I wouldn't be surprised if it's at least against a policy somewhere, despite the fact that the transcripts can be found on numerous different websites around the internet and seem to reside in those places in perpetuity and no one seems to care. That said, even if I were just scraping a web page for a remotely hosted script on demand rather than storing it in my package, I'd still want to maintain a separate doomsday copy of the scripts myself in case the third-party website I'm relying on ever goes out of commission. I can count on sites like Memory Alpha and Memory Beta sticking around (though they do let you download snapshots of their entire database if you want to), but a random Star Trek fan website that looks like it was made 20 years ago, who knows haha.

I was looking into some of the script sources today. It looks like a balancing act between choosing consistency of format vs. maximizing information: Sites that host the authentic transcripts are clearly of highest quality. Those scripts contain the most information, namely scene description. However, it looks like for most episodes for some series, these scripts are unobtainable. On the other hand, sites like chakoteya.net seem to have all episodes and all in a consistent format for text parsing (nothing worse than having to write different code for different series and websites). However, the downside here is that information is clearly lost. The dialog is all there, but plenty of other elements of the scripts are missing. I don't know for sure, but I wonder if these were painstakingly written out by a fan rather than pulled from original scripts.

Perhaps I would still use the latter because it would be a low hanging fruit. A single approach to parsing consistently formatted text, and from the perspective that it's ultimately the dialog that is most critical. Still though, it would be a shame to not have all the rich information in other table columns like you were able to do working with original TNG scripts.

tylerlittlefield commented 5 years ago

Yeah, the website I'm grabbing the transcripts from is certainly lacking in the available scripts but fortunately has more information. They look pretty authentic, I wonder where the maintainer of the website found them.

Like you mentioned, the columns "setting" and "description" (though the latter is sparse) are nice to have. I'm particularily interested in the "description" column (this could have a better name) which is basically additional meta data describing the characters emotions, actions, facial expressions, etc. For example:

as_tibble(ds9$Chimera) %>% 
  .[9,] %>% 
  glimpse()

#> Observations: 1
#> Variables: 5
#> $ perspective <chr> "2 CONTINUED:"
#> $ setting     <chr> "O'Brien's features falter at hearing this."
#> $ character   <chr> "ODO"
#> $ description <chr> "(misunderstanding)"
#> $ line        <chr> "You don't think she'll like it?"

Regarding chakoteya.net, I read somewhere that those were pulled from closed captioning data but have no evidence that this is true.

Regarding a single approach to parsing these transcripts, I've just ran the script against Deep Space Nine and only had to make a slight adjustment. This adjustment might actually fix some of the episode names in the TNG list (if there are any issues to begin with). Still, the build scripts could be cleaned up. I was surprised how consistent the transcripts were.

leonawicz commented 5 years ago

Ah, that makes sense, I can see it now looking at chakoteya. It does read like captioning. I agree, there is a lot of value in the supplemental variables, especially for informing things like sentiment analysis.

I'm leaning toward the idea of using an approach like yours on the original transcripts, and then supplementing with something like the caption-based versions just so there is at least a minimal entry for every episode. The resulting data frame would just have more NAs where things like setting and description are lacking. It will take more code, but seems worth it. The incompleteness is just the nature of what is available, but it would be great to merge both versions.

I'd probably do some additional text reformatting/cleanup and figure out the best way to standardize things so that episode tables could easily be joined onto other related datasets. Maybe a source column indicating script vs. captioning could be a good addition; then you can filter rows on data quality essentially. It appears this site has already gathered transcripts in this manner, pulling together original scripts and the caption-based scripts where originals are not available. I only noticed a single Voyager episode missing (hyperlink removed), but it could be obtained from the other site.

I'll have to play around with the data to be sure, but at the moment I'm considering a nested data frame that when printed (as a tibble) it shows one row for every episode/series and the rows for all the lines are nested. But maybe that will turn out to be not ideal, not sure yet.

tylerlittlefield commented 5 years ago

That sounds good, I like the idea of merging both with the added "source" column. Additionally, I think theres a bit more data you could attach, like date, authors, etc.

A nested data frame sounds nice. Although I do like the convenience of autocompletion from lists, maybe you'd have an episode list. For example, if I wanted to filter for a specific episode in the nested data frame, it would be nice to be able to do df %>% filter(episode == episodes$the_inner_light.

Regarding the structure, would you imagine something like this:

library(tidyverse)
library(startrek)

tng_all <- tng %>% 
  bind_rows(.id = "episode") %>% 
  mutate(series = "TNG") %>% 
  select(series, episode, everything())

ds9_all <- ds9 %>% 
  bind_rows(.id = "episode") %>% 
  mutate(series = "DS9") %>% 
  select(series, episode, everything())

bind_rows(tng_all, ds9_all) %>% 
  group_by(series, episode) %>% 
  nest()
#> # A tibble: 348 x 3
#>    series episode                     data              
#>    <chr>  <chr>                       <list>            
#>  1 TNG    encounter_at_farpoint       <tibble [805 × 5]>
#>  2 TNG    the_naked_now               <tibble [405 × 5]>
#>  3 TNG    code_of_honor               <tibble [438 × 5]>
#>  4 TNG    haven                       <tibble [421 × 5]>
#>  5 TNG    where_none_have_gone_before <tibble [409 × 5]>
#>  6 TNG    the_last_outpost            <tibble [493 × 5]>
#>  7 TNG    lonely_among_us             <tibble [450 × 5]>
#>  8 TNG    justice                     <tibble [452 × 5]>
#>  9 TNG    the_battle                  <tibble [523 × 5]>
#> 10 TNG    hide_and_q                  <tibble [363 × 5]>
#> # … with 338 more rows

I would be interested in hearing you ideas regarding standardization. I agree that there is some housekeeping needed. For example, the column "character" isn't very friendly due to the function character(). There's also the naming convention for episodes, do you preserve them or convert them to a case which is easier to call on.

leonawicz commented 5 years ago

Yeah I was picturing something like that table in my head. There are other tables in or around rtrek that work similarly, and anything I prep from the novels is done using epubr, which does a book-per-row kind of table with the content nested. So I'm not saying these are all the "right" or "best" way to do it. For me it's more about trying to keep as much consistency as I can across diverse datasets within rtrek, knowing that eventually rtrek will contain/access a whole lot of diverse data.

I'm not sure a column named character is a big deal given the context it would be used in. Columns with names like that are probably impossible to avoid in a lot of the remotely accessed data like from STAPI.

One thing I like about having episode names in an episode column is that it is no issue to allow something like basic sentence case for the value, whereas episode names as the names of a list might prompt you to enforce more appropriate names.

One reason I was thinking of this is related to the other thing you bring up about other columns like author, date, etc., all things which only require one row like the title and don't need to be nested like the script lines. Ideally it would be great to have consistency in titles across the scripts data and other datasets like those from STAPI, the timeline data, etc., anywhere where these episode titles might be referenced, because that would allow easy dplyr join calls to connect that information you mention to the script data frame. For example, STAPI offers mostly real world data, like authors, dates, associated production companies, on and on and on, tied to things like episodes. All those more metadata-like fields could be joined to episode text seamlessly if we chose to apply the same kind of convention for naming them.

Like you can see this on the github readme:

Q <- "CHMA0000025118"  #unique ID
Q <- stapi("character", uid = Q)
Q$episodes %>% select(uid, title, stardateFrom, stardateTo)
#>              uid                 title stardateFrom stardateTo
#> 1 EPMA0000001458    All Good Things...      47988.0    47988.0
#> 2 EPMA0000001329                 Q Who      42761.3    42761.3
#> 3 EPMA0000001377                  Qpid      44741.9    44741.9
#> 4 EPMA0000000483 Encounter at Farpoint      41153.7    41153.7
#> 5 EPMA0000000651              Tapestry           NA         NA
#> 6 EPMA0000000845                Q-Less      46531.2    46531.2
#> 7 EPMA0000162588            Death Wish           NA         NA
#> 8 EPMA0000001413                True Q      46192.3    46192.3
#> 9 EPMA0000001510    The Q and the Grey      50384.2    50392.7

The titles happen to use title case, so maybe that is best for standardizing. In fact if I find I have some datasets in rtrek that don't use the same format, due to whatever source the data was pulled from, I'd probably eventually change it to match wherever feasible.

Another thing I mean by cleanup is just the gritty side of text parsing, like how above you see All Good Things... with three periods. But in a lot of other sources I parsed for data, and it looks like it might be the case in these scripts as well, you can run into the ... ellipses special character. So I go to some length to substitute that with three actual periods, turn curly single and double quotes to straight quotes, replace all special hyphens with the basic hyphen, and so on. It's not fun, but trekdata contains functions that have a bunch of relevant regex patterns and gsub calls in them to give you an idea for what I've had to clean up most often. It's useful canned code to copy and reuse on any text that has a bunch of "pretty" characters that need to be replaced. It goes a long way to making things clean and making future downstream data manipulations less problematic.

leonawicz commented 5 years ago

For example, there is convenient use of tools::toTitleCase. You can also see some common cleanup substitutions here. Just stuff like that goes a long way to beautiful data 😃.

tylerlittlefield commented 5 years ago

Ah, I see. Consistency is important, no doubt.

In general, I've avoided nested data frames (even though I think it's a really cool idea) at all costs because I've struggled so much with these when pulling from some JSON API. Recent additions to tidyr like unnest_auto(), unnest_wider(), etc. look promising though.

Having said that, the point you made about one row variables makes nested data frames seem appropriate and enjoyable. I can imaging filtering by a specific date range or set of authors and then unnesting the result.

I didn't know tools had such a function, very cool. I've always used stringr::str_to_title(). I will take a closer look at trekdata, text parsing is an interesting and fun topic. Oh and since we're on the topic of text parsing, I figured you'd get a kick out of this issue (https://github.com/tyluRp/startrek/issues/1). Seems there is one odd ball script out of the entire TNG/DS9 series which wasn't consistent with all the others. Ugh!

leonawicz commented 5 years ago

I've added a new function to trekdata for compiling metadata associated with the scripts. It results in a data frame like this:

> (x <- trekdata::st_script_info())
# A tibble: 712 x 6
   format  series number season title                          url                                                  
   <chr>   <chr>   <dbl>  <int> <chr>                          <chr>                                                
 1 episode TOS         0      1 The Cage                       https://scifi.media/wp-content/uploads/t/os/s0-01.txt
 2 episode TOS         1      1 The Man Trap                   https://scifi.media/wp-content/uploads/t/os/s1-01.txt
 3 episode TOS         2      1 Charlie X                      https://scifi.media/wp-content/uploads/t/os/s1-02.txt
 4 episode TOS         3      1 Where No Man Has Gone Before   https://scifi.media/wp-content/uploads/t/os/s1-03.txt
 5 episode TOS         4      1 The Naked Time                 https://scifi.media/wp-content/uploads/t/os/s1-04.txt
 6 episode TOS         5      1 The Enemy Within               https://scifi.media/wp-content/uploads/t/os/s1-05.txt
 7 episode TOS         6      1 Mudd's Women                   https://scifi.media/wp-content/uploads/t/os/s1-06.txt
 8 episode TOS         7      1 What are Little Girls Made Of? https://scifi.media/wp-content/uploads/t/os/s1-07.txt
 9 episode TOS         8      1 Miri                           https://scifi.media/wp-content/uploads/t/os/s1-08.txt
10 episode TOS         9      1 Dagger of the Mind             https://scifi.media/wp-content/uploads/t/os/s1-09.txt
# ... with 702 more rows

I haven't yet filled in from other sources, but I've left room for what appears to be missing from the given website:

> dplyr::filter(x, is.na(url))
# A tibble: 5 x 6
  format  series number season title url  
  <chr>   <chr>   <dbl>  <int> <chr> <chr>
1 episode TNG         2      1 NA    NA   
2 episode DS9         2      1 NA    NA   
3 episode DS9        74      4 NA    NA   
4 episode VOY        15      1 NA    NA   
5 episode ENT         2      1 NA    NA   

I'm a little suspicious of these season 1 episode 2's. Wonder what that's about...

I've handled most of the text cleanup, but I'm unsure what to do about these numbers in parentheses. I want to remove them, but they don't seem to always indicate the same thing. Some can clearly be replaced with a , Part 1 or , Part 2 suffix to match other two-part episodes. But for several of these, there is a (1) but no (2), which is confusing. And for example, although the Broken Bow episode says both 1/2 in the title and has a (1), I think it's the complete script.

> x$title[grep("\\(\\d\\)", x$title)]
 [1] "Scorpion (1)"               "Equinox (1)"                "Episode 1/2 Broken Bow (1)" "Shockwave (1)"              "Shockwave (2)"              "Azati Prime (1)"            "Damage (2)"                 "Storm Front (1)"           
 [9] "Storm Front (2)"            "Borderland (1)"             "Cold Station 12 (2)"        "The Augments (3)"           "The Forge (1)"              "Awakening (2)"              "Kir'Shara (3)"              "Babel One (1)"             
[17] "United (2)"                 "The Aenar (3)"              "Affliction (1)"             "Divergence (2)"             "In a Mirror, Darkly (1)"    "In a Mirror, Darkly (2)"    "Demons (1)"                 "Terra Prime (2)"

But this gives a really nice table for iterating over the urls for subsequent downloading and text processing.

I'm picturing the download function having an argument like keep_files = TRUE for retaining all the downloaded txt files for backup, with control over where to download them to and giving them more informative names on download using the available metadata in the table. I'll work on the download aspect and I'll follow up here when I have a working example. It's something that can match well with other scraping functionality in trekdata/rtrek.

Oh, just saw your comment come through as I was pasting this one together. Haha! Yeah, there seems to always be at least one odd ball text file! 😂 Text parsing is always messy.

I've also been thinking about a function for converting a script in a row of the finalized data frame to a nicely formatted epub file for a more pleasant human reading experience. That would involve converting to R markdown and then rendering that intermediary file to epub with bookdown. But if I do this, it's a more general functionality I will build into epubr rather than just put a specific version in trekdata.

leonawicz commented 5 years ago

Initial example. This will download all the available files:

x <- trekdata::st_script_download("data-raw/episode_scripts", TRUE)

Only tested it on Windows. In addition to the handful clearly missing that I mentioned above, it appears there are a few more that return a 404 file not found error as well. Warnings will show afterward to indicate this.

Warning messages:
1: In utils::download.file(x$url[i], file, quite = TRUE) :
  cannot open URL 'https://scifi.media/wp-content/uploads/t/voy/s5-16.txt': HTTP status was '404 Not Found'
2: In utils::download.file(x$url[i], file, quite = TRUE) :
  cannot open URL 'https://scifi.media/wp-content/uploads/t/voy/s7-10.txt': HTTP status was '404 Not Found'
3: In utils::download.file(x$url[i], file, quite = TRUE) :
  cannot open URL 'https://scifi.media/wp-content/uploads/t/voy/s7-26.txt': HTTP status was '404 Not Found'
> x
# A tibble: 707 x 7
   format  series number season title                          url                                                   text       
   <chr>   <chr>   <dbl>  <int> <chr>                          <chr>                                                 <list>     
 1 episode TOS         0      1 The Cage                       https://scifi.media/wp-content/uploads/t/os/s0-01.txt <chr [634]>
 2 episode TOS         1      1 The Man Trap                   https://scifi.media/wp-content/uploads/t/os/s1-01.txt <chr [495]>
 3 episode TOS         2      1 Charlie X                      https://scifi.media/wp-content/uploads/t/os/s1-02.txt <chr [511]>
 4 episode TOS         3      1 Where No Man Has Gone Before   https://scifi.media/wp-content/uploads/t/os/s1-03.txt <chr [596]>
 5 episode TOS         4      1 The Naked Time                 https://scifi.media/wp-content/uploads/t/os/s1-04.txt <chr [521]>
 6 episode TOS         5      1 The Enemy Within               https://scifi.media/wp-content/uploads/t/os/s1-05.txt <chr [630]>
 7 episode TOS         6      1 Mudd's Women                   https://scifi.media/wp-content/uploads/t/os/s1-06.txt <chr [676]>
 8 episode TOS         7      1 What are Little Girls Made Of? https://scifi.media/wp-content/uploads/t/os/s1-07.txt <chr [528]>
 9 episode TOS         8      1 Miri                           https://scifi.media/wp-content/uploads/t/os/s1-08.txt <chr [671]>
10 episode TOS         9      1 Dagger of the Mind             https://scifi.media/wp-content/uploads/t/os/s1-09.txt <chr [474]>
# ... with 697 more rows

The first time this is run, it takes a while to download all the files. But if you create a directory like above and set the download directory, and set keep = TRUE, then you will have all the files. Afterward, recompiling the table is much quicker because it works with local files if they are in the given directory (unless you add overwrite = TRUE, which will force a redownload of an existing file).

As a stand in for now I just read the lines of text into a text column in the data frame. But this is where any text processing would occur and this could be a list of data frames instead of character vectors.

leonawicz commented 5 years ago

trekdata::st_script_download is now up to date. I sorted out the various issues and the resulting table merges scripts and metadata from two websites now. Looks like the download now includes all scripts.

tylerlittlefield commented 5 years ago

Awesome! That was quick! I will have to play around with it this afternoon, very cool 👍

leonawicz commented 5 years ago

There's also st_script_text_df. The output is probably not identical to yours but may be pretty close. It looks like it worked on the TNG scripts I was comparing it with. I haven't handled the case for captions-derived scripts yet, but it will probably simpler at least. 😃

leonawicz commented 5 years ago

Whew! A "final initial" example. I've added st_transcripts to rtrek. It covers all episodes and movies (apart from the reboot movies and Discovery of course). It's far from perfect, but good enough to put into production, a good first version. Here's an example doc.

Parsing these scripts was brutal lol. Enough so that I'm actually looking forward to thinking about text mining some novels again (I can regret this later 😂 )