Strange random exception thrown by Treex::PML::Backend::PML

dan-zeman commented 6 years ago

I have a large number (hundreds) of .treex.gz files that I process in two steps. Each step is a parallelized treex run on the ÚFAL cluster. The first step generates .treex.gz files, the second step reads them. Every now and then the reader in the second step crashes. I have observed it with various corpora; it is not tied to one particular dataset.

The exception says for one or more input files that there is extra content after the PML document end. Manual inspection of the files does not reveal anything unusual.

Re-running the first step (without changing settings or sources) sometimes helps. The error disappears but it strikes back again somewhere else some other time.

Re-running the second step without re-running the first step did not help (I let it retry 11 times, then I killed it), so the random error seems to be connected to writing rather than reading.

I looked up the name of the file that could not be read, and I tried just reading it, locally (no cluster), without anything else in the scenario. Worked. I tried the full scenario on the cluster, but just with this one file. Crashed. Re-tried the same thing a second time. Worked. Huh. Ran the same scenario for all 874 files. Crashed. Retried 11 times, always crashed (sometimes on that file that I had tried to single out).

I gunzipped all input files (but did not re-run step 1), then re-run the scenario on the cluster. It worked. Just one experiment is too little evidence, but I now suspect that the bug may be related to reading/writing gzipped files from withing Perl. (Gunzip itself did not complain about the files though.)

which perl
/net/work/projects/perlbrew/Ubuntu/14.04/x86_64/perls/perl-5.18.2/bin/perl
whichpm PerlIO::via::gzip
/net/work/projects/perlbrew/Ubuntu/14.04/x86_64/perls/perl-5.18.2/lib/site_perl/5.18.2/PerlIO/via/gzip.pm 0.021

dan-zeman commented 6 years ago

Just found out that cpanm returns version 0.03 of PerlIO::via::gzip. Let's see whether it affects the error in any way.

dan-zeman commented 6 years ago

The error persists with the newer PerlIO::via::gzip.

TREEX-INFO:     4.099:  Parallelized execution. This process is one of the worker nodes, jobindex==20
TREEX-INFO:     4.177:  Loading block Treex::Block::Util::SetGlobal language=ar (1/6)
TREEX-INFO:     4.210:  Loading block Treex::Block::Read::Treex from=!/net/work/people/zeman/hamledt-data/ar/treex/01/{train,dev,test}/*.treex.gz (2/6)
TREEX-INFO:     4.478:  Loading block Treex::Block::A2A::CopyAtree source_selector= selector=prague (3/6)
TREEX-INFO:     4.503:  Loading block Treex::Block::HamleDT::Udep  (4/6)
TREEX-INFO:     4.657:  Loading block Treex::Block::Write::CoNLLU print_zone_id=0 substitute={treex/01}{conllu} compress=1 (5/6)
TREEX-INFO:     4.710:  Loading block Treex::Block::Write::Treex substitute={conllu}{treex/02} compress=1 (6/6)
TREEX-INFO:     4.720:  ALL BLOCKS SUCCESSFULLY LOADED.
TREEX-INFO:     4.721:  Loading the scenario took 0 seconds
TREEX-INFO:     4.733:  Applying process_start
Error occured while reading '/net/work/people/zeman/hamledt-data/ar/treex/01/train/AFP_ARB_20000815.0001.treex.gz' using backend Treex::PML::Backend::PML:
file:///net/work/people/zeman/hamledt-data/ar/treex/01/train/AFP_ARB_20000815.0001.treex.gz:11237: parser error : Extra content at the end of the document

martinpopel commented 6 years ago

If the error is on random files and you just need the work done without solving the problem, there is the --skip_finished option, which makes re-running the whole experiment easier.

I see the bug is non-deterministic, but still it would be nice to have a minimal test, ideally committed in a branch of this repo. One test may contain the wrong file and just do the reading (which should deterministically fail). Another test may contain reading+writing in a for cycle 1..100, so there is a higher chance the error shows up. Without the test, I cannot work on this. But maybe it will be easier for you to solve the problem than write a test:-):

If the problem is with writing, you can try switching from open (my $gzip_fh, "| gzip -c > $filename.gz") here to using IO::Compress::Gzip.

If you suspect PerlIO::via::gzip, you can try switching to PerlIO::gzip as I did in Udapi. Theoretically, it is just a matter of deleting the "via::" and changing '<:via(gzip):' to '<:gzip:'. Compare Treex and Udapi.

dan-zeman commented 6 years ago

I wanted to create a minimal test but the failing file does not fail deterministically.

Thanks for the pointers to the other gzip options. I will try them when I have time.

dan-zeman commented 6 years ago

I have not seen the error when processing the files sequentially outside the cluster, and I have not seen the error when processing uncompressed data (based on several experiments now).

I have seen it sometimes with gzipped treex data processed in parallel on several cluster machines.

martinpopel commented 6 years ago

I see. Writing tests for parallel cluster processing is a bit more tricky, but still possible. Have you checked whether it's always the same machine which produced the faulty treex.gz file? Maybe there is a different version of gzip installed. Anyway, if this is the case using IO::Compress::Gzip should solve it.

dan-zeman commented 6 years ago

Not sure whether it's always the same (set of) machine(s). It takes some effort to dig this information out of the logs. But in one case I identified the machine from which the error message came (lucifer6), logged into it and processed the entire batch of presumably wrong files sequentially there. No error.

ufal / treex

Strange random exception thrown by Treex::PML::Backend::PML #70