Improve e-mail-address processing

Problem

When considering the following two From lines in mbox files, Codeface will run into problems right now:

From: ambrus at math.bme.hu (=?UTF-8?Q?Zsb=C3=A1n_Ambrus?=) From: Hans Huber

While the string at is replaced by @ in the first case, it is not in the second. In the first case, the name is not properly parsed (it is Zsbán Ambrus actually), the string is stored as is in the database.

mysql> select name, email1 from person;
+-------------------------------+------------------------+
| name                          | email1                 |
+-------------------------------+------------------------+
| =?UTF-8?Q?Zsb=C3=A1n_Ambrus?= | ambrus@math.bme.hu     |
| Hans Huber                    | huber at hubercorp.com |
+-------------------------------+------------------------+

Fix for case two

The following patch by @wolfgangmauerer (taken from the mailing-list, tested by me) implements a more robust handling of e-mail addresses and is able to handle the second case to be transformed correctly.

diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 53a7335..98895c7 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -226,6 +226,11 @@ check.corpus.precon <- function(corp.base) {
     ## Trim trailing and leading whitespace
     author <- str_trim(author)

+    ## Replace textual ' at  ' with @, sometimes
+    ## we can recover an email
+    author <- sub(' at ', '@', author)
+    author <- sub(' AT ', '@', author)
+
     ## Check if email exists
     email.exists <- grepl("<.+>", author, TRUE)

@@ -234,11 +239,6 @@ check.corpus.precon <- function(corp.base) {
                    "<xxxyyy@abc.tld>); attempting to recover from: ", author)
       logdevinfo(msg, logger="ml.analysis")

-      ## Replace textual ' at  ' with @, sometimes
-      ## we can recover an email
-      author <- sub(' at ', '@', author)
-      author <- sub(' AT ', '@', author)
-
       ## Check for @ symbol
       r <- regexpr("\\S+@\\S+", author, TRUE)
       email <- substr(author, r, r + attr(r,"match.length")-1)
@@ -258,7 +258,7 @@ check.corpus.precon <- function(corp.base) {
         ## string minus the new email part as name, and construct
         ## a valid name/email combination
         name <- sub(email, "", author, fixed=TRUE)
-        name <- str_trim(name)
+        name <- fix.name(name)
       }

       ## Name and author are now given in both cases, construct
@@ -266,13 +266,15 @@ check.corpus.precon <- function(corp.base) {
       author <- paste(name, ' <', email, '>', sep="")
     }
     else {
-      ## Verify that the order is correct
+      ## There is a correct email address. Ensure that the order is correct
+      ## and fix cases like "<hans.huber@hubercorp.com> Hans Huber"
+
       ## Get email and name parts
       r <- regexpr("<.+>", author, TRUE)
       if(r[[1]] == 1) {
         email <- substr(author, r, r + attr(r,"match.length")-1)
         name <- sub(email, "", author, fixed=TRUE)
-        name <- str_trim(name)
+        name <- fix.name(name)
         email <- str_trim(email)
         author <- paste(name,email)
       }
diff --git a/codeface/R/ml/ml_utils.r b/codeface/R/ml/ml_utils.r
index 963cd2d..f596829 100644
--- a/codeface/R/ml/ml_utils.r
+++ b/codeface/R/ml/ml_utils.r
@@ -445,3 +445,15 @@ ml.thread.loc.to.glob <- function(ml.id.map, loc.id) {

   return(global.id)
 }
+
+## Given a name with leading and pending whitespace that is possibly
+## surrounded by braces, return the name proper.
+fix.name <- function(name) {
+    name <- str_trim(name)
+    if (substr(name, 1, 1) == "(" && substr(name, str_length(name),
+                                            str_length(name)) == ")") {
+        name <- substr(name, 2, str_length(name)-1)
+    }
+
+    return (name)
+}

After applying the patch, the database contains:

mysql> select name, email1 from person;
+-------------------------------+----------------------+
| name                          | email1               |
+-------------------------------+----------------------+
| =?UTF-8?Q?Zsb=C3=A1n_Ambrus?= | ambrus@math.bme.hu   |
| Hans Huber                    | huber@hubercorp.com  |
+-------------------------------+----------------------+

Things to do

[x] Incorporate the patch into Codeface core
[ ] Fix the encoding problem somehow

siemens / codeface