Open crowbot opened 10 years ago
@crowbot Why not release all WDTK raw messages along with their expected, processed output? It seems a lot is held up by a fear that fixing one edge case may break an old edge case. If we can just run parse_raw_email!
over all those messages and see if what comes out is the same as before, we can start being a lot less careful about refactoring.
Hi @jpmckinney, sorry you got the run around looking at different pull requests, and thanks for your comments. There are a few considerations that would challenge the idea of making all WDTK emails public:
Given these considerations, I think my preference at the moment is still for fleshing out the test suite with more examples of raw emails and expected output to give us more refactoring confidence. Perhaps we can generate some test examples automaticall from the WDTK data set to accelerate this process though.
The only way to find out how long it actually takes is to try it. We can try running the current parse_raw_email!
over those messages, save the results as our "target output" and see how long it takes.
Now let's say we make some drastic changes to mail handling, and then run it over the dataset - maybe it takes a day, we don't know. If it produces tens of thousands of false positives due to being semantically but not computationally identical, there are two possibilities: either it's possible to correct the tests to avoid false positives, or it turns out this exercise was a mistake since no one is going to filter through that many failures. On the other hand, if there are a manageable number of failures, you don't need to run the test against the whole dataset again - you just run it against the failing messages which will be fast. One all of them pass, then you run it over the full dataset again before committing. You can do other things while the tests run. No lack of things to do!
If I git log spec/fixtures
or more specifically git log spec/fixtures/files
, I see that raw emails are being added to the test suite at a glacial pace, so I think a bit of creativity and experimentation couldn't hurt. Indeed, the above process may identify the raw emails that ought to be added to the test suite, so that the above process needn't be used frequently.
@mhl you've run email parsing over all the mails in WDTK - how long did it take?
To clarify, I was suggesting a more concerted effort to add mail parsing examples to the test suite as an alternative to reparsing the entire WDTK data set on each change, rather than adding examples as they come up (which is what we've been doing up to now).
When I say:
Perhaps we can generate some test examples automaticall from the WDTK data set to accelerate this process though.
and you say:
Indeed, the above process may identify the raw emails that ought to be added to the test suite, so that the above process needn't be used frequently.
I think we're in violent agreement :)
Given that the recent bug report against the mail parsing in Alaveteli isn't seen when using the attachments method directly from the mail gem, it seems like we might be able to use mail gem methods more directly and possibly get rid of some code from
mail_handler/backends/mail_backend.rb
. As @garethrees points out, we'd need some better tests of current expected output first.