Replace 'get_attachment_leaves' with direct call to mail gem

crowbot commented 10 years ago

Given that the recent bug report against the mail parsing in Alaveteli isn't seen when using the attachments method directly from the mail gem, it seems like we might be able to use mail gem methods more directly and possibly get rid of some code from mail_handler/backends/mail_backend.rb. As @garethrees points out, we'd need some better tests of current expected output first.

jpmckinney commented 10 years ago

@crowbot Why not release all WDTK raw messages along with their expected, processed output? It seems a lot is held up by a fear that fixing one edge case may break an old edge case. If we can just run parse_raw_email! over all those messages and see if what comes out is the same as before, we can start being a lot less careful about refactoring.

crowbot commented 10 years ago

Hi @jpmckinney, sorry you got the run around looking at different pull requests, and thanks for your comments. There are a few considerations that would challenge the idea of making all WDTK emails public:

It's quite a large data set - half a million emails. Running any change over all of them is going to take a significant amount of time. The format differences in this data set could be represented by a much smaller sample set.
Depending on what you mean by 'release', I think the unredacted email addresses in the emails may pose a problem
There is a difference between a computationally identical processed output and one that is semantically identical - we saw this in our preparation for the move to Rails 3, which invloves using the Mail gem instead of TMail - although the results of parsing were semantically the same in lots of cases, there were trivial differences in them.

Given these considerations, I think my preference at the moment is still for fleshing out the test suite with more examples of raw emails and expected output to give us more refactoring confidence. Perhaps we can generate some test examples automaticall from the WDTK data set to accelerate this process though.

jpmckinney commented 10 years ago

The only way to find out how long it actually takes is to try it. We can try running the current parse_raw_email! over those messages, save the results as our "target output" and see how long it takes.

Now let's say we make some drastic changes to mail handling, and then run it over the dataset - maybe it takes a day, we don't know. If it produces tens of thousands of false positives due to being semantically but not computationally identical, there are two possibilities: either it's possible to correct the tests to avoid false positives, or it turns out this exercise was a mistake since no one is going to filter through that many failures. On the other hand, if there are a manageable number of failures, you don't need to run the test against the whole dataset again - you just run it against the failing messages which will be fast. One all of them pass, then you run it over the full dataset again before committing. You can do other things while the tests run. No lack of things to do!

If I git log spec/fixtures or more specifically git log spec/fixtures/files, I see that raw emails are being added to the test suite at a glacial pace, so I think a bit of creativity and experimentation couldn't hurt. Indeed, the above process may identify the raw emails that ought to be added to the test suite, so that the above process needn't be used frequently.

crowbot commented 10 years ago

@mhl you've run email parsing over all the mails in WDTK - how long did it take?

crowbot commented 10 years ago

To clarify, I was suggesting a more concerted effort to add mail parsing examples to the test suite as an alternative to reparsing the entire WDTK data set on each change, rather than adding examples as they come up (which is what we've been doing up to now).

When I say:

Perhaps we can generate some test examples automaticall from the WDTK data set to accelerate this process though.

and you say:

Indeed, the above process may identify the raw emails that ought to be added to the test suite, so that the above process needn't be used frequently.

I think we're in violent agreement :)

mysociety / alaveteli

Replace 'get_attachment_leaves' with direct call to mail gem #1476