MLCP skips bad records without reporting when using splits

This is something I noticed recently but worked around by removing bad rows (those where the column count does not match the header count) during pre-processing. However, it can still cause issues for other types of bad rows.

Summary

When using mlcp command-line splits, and depending on the size of the split, mlcp can lose data.

This was observed while ingesting different large files (~1M records) with a small percentage of bad records and various split sizes. It was also observed that the number of unaccounted records changes up and down depending on the split size. It depended on whether the split crossed a bad record or not.

Repro

Generate a large csv file which includes randomly broken rows, like this:

H1,H2
a,b
c,d
d,e,f   #Column number mis-match
g,h,
etc..

Note: longer bad rows are better for reproducing the issue.

If the split boundary occurs on a broken row, that row is lost without being reported. Changing the split size will change the number of rows that are lost without being reported. Removing the split option will skip the bad rows but they will be reported and everything is accounted for.

The result is that when checking the mlcp log, the totals + skipped do not match the actual number of records in the file. It can seem like everything was successfully ingested because the skips are silently dropped.

This has been tested with several recent versions of mlcp.

marklogic / marklogic-contentpump

MLCP skips bad records without reporting when using splits #147

Summary

Repro