siemens / codeface

Codeface is a framework for analysing technical and social aspects of software development
siemens.github.io/codeface
GNU General Public License v2.0
67 stars 38 forks source link

Improve e-mail-address processing #34

Open clhunsen opened 9 years ago

clhunsen commented 9 years ago

Problem

When considering the following two From lines in mbox files, Codeface will run into problems right now:

From: ambrus at math.bme.hu (=?UTF-8?Q?Zsb=C3=A1n_Ambrus?=) From: Hans Huber

While the string at is replaced by @ in the first case, it is not in the second. In the first case, the name is not properly parsed (it is Zsbán Ambrus actually), the string is stored as is in the database.

mysql> select name, email1 from person;
+-------------------------------+------------------------+
| name                          | email1                 |
+-------------------------------+------------------------+
| =?UTF-8?Q?Zsb=C3=A1n_Ambrus?= | ambrus@math.bme.hu     |
| Hans Huber                    | huber at hubercorp.com |
+-------------------------------+------------------------+

Fix for case two

The following patch by @wolfgangmauerer (taken from the mailing-list, tested by me) implements a more robust handling of e-mail addresses and is able to handle the second case to be transformed correctly.

diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 53a7335..98895c7 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -226,6 +226,11 @@ check.corpus.precon <- function(corp.base) {
     ## Trim trailing and leading whitespace
     author <- str_trim(author)

+    ## Replace textual ' at  ' with @, sometimes
+    ## we can recover an email
+    author <- sub(' at ', '@', author)
+    author <- sub(' AT ', '@', author)
+
     ## Check if email exists
     email.exists <- grepl("<.+>", author, TRUE)

@@ -234,11 +239,6 @@ check.corpus.precon <- function(corp.base) {
                    "<xxxyyy@abc.tld>); attempting to recover from: ", author)
       logdevinfo(msg, logger="ml.analysis")

-      ## Replace textual ' at  ' with @, sometimes
-      ## we can recover an email
-      author <- sub(' at ', '@', author)
-      author <- sub(' AT ', '@', author)
-
       ## Check for @ symbol
       r <- regexpr("\\S+@\\S+", author, TRUE)
       email <- substr(author, r, r + attr(r,"match.length")-1)
@@ -258,7 +258,7 @@ check.corpus.precon <- function(corp.base) {
         ## string minus the new email part as name, and construct
         ## a valid name/email combination
         name <- sub(email, "", author, fixed=TRUE)
-        name <- str_trim(name)
+        name <- fix.name(name)
       }

       ## Name and author are now given in both cases, construct
@@ -266,13 +266,15 @@ check.corpus.precon <- function(corp.base) {
       author <- paste(name, ' <', email, '>', sep="")
     }
     else {
-      ## Verify that the order is correct
+      ## There is a correct email address. Ensure that the order is correct
+      ## and fix cases like "<hans.huber@hubercorp.com> Hans Huber"
+
       ## Get email and name parts
       r <- regexpr("<.+>", author, TRUE)
       if(r[[1]] == 1) {
         email <- substr(author, r, r + attr(r,"match.length")-1)
         name <- sub(email, "", author, fixed=TRUE)
-        name <- str_trim(name)
+        name <- fix.name(name)
         email <- str_trim(email)
         author <- paste(name,email)
       }
diff --git a/codeface/R/ml/ml_utils.r b/codeface/R/ml/ml_utils.r
index 963cd2d..f596829 100644
--- a/codeface/R/ml/ml_utils.r
+++ b/codeface/R/ml/ml_utils.r
@@ -445,3 +445,15 @@ ml.thread.loc.to.glob <- function(ml.id.map, loc.id) {

   return(global.id)
 }
+
+## Given a name with leading and pending whitespace that is possibly
+## surrounded by braces, return the name proper.
+fix.name <- function(name) {
+    name <- str_trim(name)
+    if (substr(name, 1, 1) == "(" && substr(name, str_length(name),
+                                            str_length(name)) == ")") {
+        name <- substr(name, 2, str_length(name)-1)
+    }
+
+    return (name)
+}

After applying the patch, the database contains:

mysql> select name, email1 from person;
+-------------------------------+----------------------+
| name                          | email1               |
+-------------------------------+----------------------+
| =?UTF-8?Q?Zsb=C3=A1n_Ambrus?= | ambrus@math.bme.hu   |
| Hans Huber                    | huber@hubercorp.com  |
+-------------------------------+----------------------+

Things to do

clhunsen commented 9 years ago

I had another unshaped thought on this: What about making the several patterns that are supported by Codeface explicit in the source code?

To be more specific, I thought about having one explicit pattern definition (i.e., a regex) and one transformation pattern (i.e., the regex replacement or rewriting) for each pattern Codeface supports, to transform the various patterns to the one we want to have (Hans Huber <huber@hubercorp.com>). Plus a routine for mis-shaped strings, where we need to handle missing e-mail addresses, missing names, or similar.

This way, we only need to take care of one pattern for extraction, and the different patterns would be explicit and transparent in the source code, too.