We make some attempt to drop entirely empty rows in commits 3a8ed9369ebe3177e1ce4ab8baefdf4b7a79b40a, 0b3bc83721e3ea710727619b128c6ee9a5c2b64d and e7f1c16ebe1222a7ab1f0ddfaf4eaf39dd57322e. It seems to work pretty well, but there are tradeoffs involved. In particular, we blank out information, so we have no way of knowing if surviving "empty" cells are actually empty on the page, or just have not been filled in. Blanking out information also have (maybe positive, maybe negative) consequences for reduction, as IIRC empty cells are not included when reducing text.
We currently also drop blanked-out rows from the views file too, on the principle that real data might turn up for them, and so we want to make sure that we re-read that row any time that we read in a new tranche of data.
This issue just exists to create some time to think through the consequences, and perhaps to improve matters. For example, it might make more sense to identify and remove empty rows before the reducer runs. We would then handle this case entirely in extract.py.
This feature currently should only be removing rows from phase2 (though the "blanking out" behaviour also has consequences for reduction in phase1, of course). See also #35 for removing empty rows from phase1.
We make some attempt to drop entirely empty rows in commits 3a8ed9369ebe3177e1ce4ab8baefdf4b7a79b40a, 0b3bc83721e3ea710727619b128c6ee9a5c2b64d and e7f1c16ebe1222a7ab1f0ddfaf4eaf39dd57322e. It seems to work pretty well, but there are tradeoffs involved. In particular, we blank out information, so we have no way of knowing if surviving "empty" cells are actually empty on the page, or just have not been filled in. Blanking out information also have (maybe positive, maybe negative) consequences for reduction, as IIRC empty cells are not included when reducing text.
We currently also drop blanked-out rows from the views file too, on the principle that real data might turn up for them, and so we want to make sure that we re-read that row any time that we read in a new tranche of data.
This issue just exists to create some time to think through the consequences, and perhaps to improve matters. For example, it might make more sense to identify and remove empty rows before the reducer runs. We would then handle this case entirely in extract.py.
This feature currently should only be removing rows from phase2 (though the "blanking out" behaviour also has consequences for reduction in phase1, of course). See also #35 for removing empty rows from phase1.