This adds a script I have been using to spot check my text cleaning implementations. Point it at the aligned dolma shards from two different versions (raw vs v0, etc.) and you can use the buttons to move through example or jump to a specific example. Note: If documents are removed between the versions, this tool will be unaligned, could be fixed in the future by moving though the example id's instead of a simple numerical index.
The data is shown in independently scrollable boxes to help align sections when there are heavy changes.
This adds a script I have been using to spot check my text cleaning implementations. Point it at the aligned dolma shards from two different versions (raw vs v0, etc.) and you can use the buttons to move through example or jump to a specific example. Note: If documents are removed between the versions, this tool will be unaligned, could be fixed in the future by moving though the example id's instead of a simple numerical index.
The data is shown in independently scrollable boxes to help align sections when there are heavy changes.
Example Screenshot