google / project-OCEAN

Project OCEAN is an open science collaboration focused on understanding the open source ecosystems creating datasets that enable research and forming a clear understanding of the state of open source communities.
https://vermontcomplexsystems.org/partner/OCEAN/
Apache License 2.0
49 stars 19 forks source link

Improve To parsing in the mailing list data that is loaded to BQ #42

Closed nyghtowl closed 1 year ago

nyghtowl commented 3 years ago

Expected Behavior

Mailing list To field should be populated by the target person that the email is responding to.

Actual Behavior

In python_mailinglist table there are some messages where To is showing up in the body but not populating the To field.

Body: "B Zy < zy at gmail.com> wrote:

Hello Help my code."

The To field is capturing the mailing list name instead.

Steps to Reproduce the Problem

  1. Review python mailing list examples
  2. Improve parsing in the extract_msgs script, probably a regex for the body
nyghtowl commented 3 years ago

Found To is still not fully parsing when everything was reloaded into BigQuery. Code and test needs to be reviewed based on actual data in BQ to see what the disconnect is.

glasnt commented 1 year ago

Closing, see #97