vosonlab / vosonSML

R package for collecting social media data and creating networks for analysis.
https://vosonlab.github.io/vosonSML/
GNU General Public License v3.0
78 stars 13 forks source link

Is Create(<actor.reddit>) creating spurious edges? #54

Closed sbinn closed 4 months ago

sbinn commented 6 months ago

Apologies if I missed something in the code or the documentation, but it seems to me that Create.actor.reddit.R somehow creates extra edges. Using vosonSML version 0.34.1.

Example:

thread_urls <- c("https://www.reddit.com/r/AusFinance/comments/ugetai/")
rd_data <- Authenticate("reddit") |>  
  Collect(threadUrls = thread_urls,
          sort = "best", 
          waitTime = c(6, 8),
          writeToFile = TRUE, 
          verbose = TRUE)

# Remove rows that have 'NA'
rd_data <- rd_data[complete.cases(rd_data), ]

rd_actor_graph <- rd_data |> 
  Create("actor") |> 
  AddText(rd_data,
          verbose = TRUE) |>
  Graph()

# Replace node IDs with actual user names
V(rd_actor_graph)$name <- V(rd_actor_graph)$user

As of collection date, this returned 1259 rows for rd_data (i.e., comments/replies) and 550 nodes with 10218 edges for rd_actor_graph. The number of nodes matches the number of unique users, so that's fine. However, looking at the edges, I'm a bit puzzled.

For example, if I look at the Reddit thread on the Web, I can see the following comment/replies. (I used the same sorting method: 'Best'.) There is one top-level comment by u/Ferox101, a reply by u/brednoq, and a reply to that by u/Ferox101. Reddit

In R, I can see the following corresponding rows: R Data The first thing that I notice is that for each comment/reply that u/Ferox101 made, there are two rows in rd_data. The two respective rows differ only by their values for

The second thing I noticed when looking at the graph in Gephi. I imported the graph using "Don't Merge" for parallel edges. A lot of edges seem to have been created for the reply by u/Ferox101: Gephi There is one edge from u/Ferox101 (n498) to u/without_my_remorse (n403, the author of the main post in the thread) representing the top-level comment ("I think you've missed a crucial word..."). You can see that edge just above the blue highlighted area. That one edge is correct.

However, there are 24 edges (highlighted in blue) from u/Ferox101 to u/without_my_remorse representing the reply "Yeah, it's pretty cheeky...". Where do all these edges come from?

bryn-g commented 6 months ago

Hi @sbinn, thank you for posting such a detailed report and apologies for the slow response. I will investigate and see if I can determine what is going wrong here.

bryn-g commented 5 months ago

Hi @sbinn,

Thank you again for your report, there was definitely a problem occurring as detailed in your example. The thread you posted in your code shows only 470 comments on reddit so there should be in the vicinity of that number in the collected data.

The problem was that shorter thread URL's that don't have the title part in them were not resolving correctly to continue threads. Instead these were resolving back to the main thread resulting in duplication of all of the main thread comments, and incorrect and duplicated thread structures.

I have updated the package in commit 46f44d4 to work correctly with the shorter URL's as in your example, or you can use the longer URL format with the previous package version and it should also work as expected (e.g.https://www.reddit.com/r/AusFinance/comments/ugetai/real_wages_have_fallen_back_to_2014_levels/).

You will not need to remove any NA entries as in your code example when it is functioning correctly.

sbinn commented 4 months ago

Hi @bryn-g, thanks a lot for the quick resolution. I would love to close the issue but haven't been able to test the solution in detail, as I've run into another issue with the Reddit collection that has now crept up - #55