google / project-OCEAN

Project OCEAN is an open science collaboration focused on understanding the open source ecosystems creating datasets that enable research and forming a clear understanding of the state of open source communities.
https://vermontcomplexsystems.org/partner/OCEAN/
Apache License 2.0
49 stars 19 forks source link

Google Groups formatting changed, unit test issues #94

Closed glasnt closed 1 year ago

glasnt commented 1 year ago

TL;DR: No Google Groups ingestion currently because of changes to Google Groups, causing scraping code to fail.

Discovered while trying to update dependencies.

Zero topics

Monthly pipeline processing was showing 0 topics returned:

2022/11/01 08:01:32 GOOGLEGROUPS loading golang-checkins:
2022/11/01 08:01:32 All topics captured: total topics captured are 0.

Checking the go code for how topic counts are captured, the regex doesn't match current Google Groups UI (there may have been some MaterialUI changes since this code was written).

E.g. https://groups.google.com/g/golang-checkins shows 1–30 of 81553 (specifically is \u2013 EN DASH). The regex in getTotalTopics specifies - (\u002D HYPHEN-MINUS).

So because the topic counts are 0, it's effecting loops later on (in my estimation)

Nest unit tests

Additionally, trying to run unit tests, it appears running just mailinglists/ doesn't run the nested mailing lists, so the unit tests for googlegroups weren't being run (and are currently breaking)

Failing topic unit tests

Now running the unit tests:

=== RUN   TestTopicIDToRawMsgUrlMap/Pull_topic_ids_for_date
2022/11/15 22:40:43 No message ID found in topicId: 8sv65_WCOS4.
    googlegroups_data_test.go:300: Result response does not match.
         got: map[2018-09.txt:[]]
        want: map[2018-09.txt:[https://groups.google.com/forum/message/raw?msg=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ]]

Infinite redirects

This URL is no longer a valid URL format, as trying to curl it gets stuck in an infinite 301 redirect loop:

$ curl https://groups.google.com/forum/message/raw\?msg\=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ
<HTML>
<HEAD>
<TITLE>Moved Permanently</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<H1>Moved Permanently</H1>
The document has moved <A HREF="https://groups.google.com/forum/message/raw?msg=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ">here</A>.
</BODY>
</HTML>

Summary

This is going to take some re-engineering to work out what's changed in the Google Groups format to bring this code back to working.

amcasari commented 1 year ago

Updating from offline discussions:

As part of Project OCEAN's Open Source Data Ecosystem, @nyghtowl (Xoogler) and members of the 20% Dive Crew scoped, designed, and built a data pipeline to aggregate mailing lists from multiple communities, including: Python, Angular, and Go.

This dataset was used in multiple research projects with our academic partners, including an accepted dataset track submission at MSR 2022.

As outlined by @glasnt, there are several updates that need to be made in the open source project and the GCP project to maintain this dataset. Polling our research stakeholders, this dataset is not currently being used for any ongoing research project.

Any changes currently made would most likely need to be maintained with future open source dependency version changes, GCP product updates, and Google Groups API/RSS supported features.

Rather than update a project no one is using, we are going to put it all on the shelf with proper documentation for future explorers and experimentation.

glasnt commented 1 year ago

Closing, see #97