Closed eu9ene closed 6 months ago
I was a bit confused by the mention of pigz paste even though you're using --input=-
but now I've found the script where you're calling this and that makes sense.
I've not been able to replicate the issue on my side. Even with the downloaded data of ru ccmatrix, it seems to run just fine here. I used the integrated gzip + paste, not through stdin, but that's technically exactly the same.
I suspect any of the steps before to be the cause of num_mismatch complaining. I've made col.py
, which wraps the first two processes in the pipeline, be even more strict about the number of columns in the input data. They'll now complain if the number of columns changes throughout their runtime. (Previously they only complained if the number of columns was insufficient for their work.)
I don't see any obvious bug in fix_quotes or remove_empty_lines.
Could you try again with current main and see whether it now crashes earlier for you?
PS: Once you go into production, I'd suggest to increase the batch size parameter quite a bit to armortise the cost of starting all the pipeline processes all the time. I'd also suggest to lower --parallel
to roughly 2 * cpu_count() / filter_steps
since all filters can run in parallel in theory, and you also want some cpu left for the pigz at the beginning and the end.
The alternative idea I have in mind is adding a --validate
flag that basically wraps every process in a script that checks their input and output.
Basically same as col.py, but slightly different purpose. But we should never be using that in production.
With the latest main it failed with my default config with a new error: task cluster log
[task 2023-10-05T22:10:57.852Z] [1/10:fasttext_filter] File "/usr/lib/python3/dist-packages/fasttext/FastText.py", line 98, in __init__
[task 2023-10-05T22:10:57.852Z] [1/10:fasttext_filter] self.f.loadModel(model_path)
[task 2023-10-05T22:10:57.852Z] [1/10:fasttext_filter] ValueError: large.bin has wrong file format!
Maybe it was the original problem, we just didn't see it. What's weird is that other datasets are being cleaned ok without any errors: task group
Then I switched the fast text filter to a small model and it looks working: task cluster log but it outputs a lot of such lines:
[2794/2:remove_empty_lines] grep: (standard input): binary file matches
I'd suggest to increase the batch size parameter quite a bit to armortise the cost of starting all the pipeline processes all the time
I thought since we process one dataset per job and if the dataset is not large, the batch size shouldn't be too large to utilize 32 cores.
all filters can run in parallel in theory
Interesting. I thought it runs them sequentially. Then indeed the batch size can be increased (I see that the default is 1M). I'm not sure what's the overhead of starting new processes though.
Anyway, it cleaned 139937785 lines CCMatrix pretty quickly. When we have the proper charts for resource utilization we'll be able to tune it better.
Huh, yeah, I wouldn't expect a filter that's waay down the pipeline to cause an input error on a filter much earlier in the pipeline.
I thought since we process one dataset per job and if the dataset is not large, the batch size shouldn't be too large to utilize 32 cores.
I need to document this better, but batch_size basically controls how many lines go into a chunk, that is then cleaned with an entire processing pipeline cycle. It works a bit like GNU Parallel, which also starts and stops the wrapped process for each chunk. Its either that, or no guarantees on the order of the output. Which… now I'm thinking about it, might not be such a bad guarantee to drop.
Interesting. I thought it runs them sequentially.
Well, yes and no. It basically starts exactly the same setup of processes as bash pigz -cd | filter1 | filter2 | pigz -c > out.gz
would do. But that first pigz can be decompressing into the buffer to filter1 while filter1 is also processing its buffer into filter2, etc. So if all processes take the same amount of processing time, they could all keep running all the time. In practice, you'll have one filter that holds up all the others, hence the 2 * cpu_count()
in my guesstimate.
[2794/2:remove_empty_lines] grep: (standard input): binary file matches
That's not good. I'll add --text
to that grep command, but … why would grep think the data is binary? Random trash? Or does your grep not know about unicode? (Not that it should really matter, its just looking for newlines, but still…)
It's still unclear why fast text fails with ValueError: large.bin has wrong file format!
. We use lid.176.bin
model in our legacy cleaning script which corresponds to large
and maybe we should keep it this way. I don't know how much is the difference between large
and small
models.
Did you use the large model for CCMatrix on ru-en in your test?
I'll open a new issue since the original issue with num_mistatch was fixed
The cleaning has been completed successfully on all other datasets and fails only on CCMatrix.
We started discussing the potential solutions here. I think if the original dataset is correct and we still have this issue, the cleaner should be able to automatically handle this. Otherwise, if the solution is to add an extra filter, we'll have random failures for some language pairs and datasets that will require manual intervention every time.
OpusCleaner version: https://github.com/hplt-project/OpusCleaner/commit/90a27f1064dfa9c82b0396c9a2b59436cce99937
log:
config: