Closed TobiasNx closed 1 month ago
After fixing #525 some strange behavior occures when using the combination of batch-reset
AND encode-marcxml
:
With encode-marcxml it creates an additional empty record with every single record.
See: https://gitlab.com/oersi/oersi-marc/-/commit/2abe46321aefc650fe85e5a88a43eba0add3b649
It must have to do with the combination of batch-reset error AND the changes done for encode-marcxml
With Metafix Runner 1.1.1
directory
| read-dir
| open-file
| as-records
| decode-json
| fix(FLUX_DIR + "oersiToMarc.fix", *)
| encode-marcxml
| write(FLUX_DIR + "test-output-${i}.xml")
;
I think here are two issues mixed - one of them is no issue at all ( s. a) & b) ). · a) IMO it's ok - when explicitly resetting the stream - that a new file is opened (it's stated exactly like this in https://github.com/metafacture/metafacture-core/blob/f5cc9dc25155ea0d1664e262f199f3d5f97c1316/metafacture-io/src/test/java/org/metafacture/io/ObjectFileWriterTest.java#L103.
b) You use batch-reset
in combination with write-files
to determine how many records should be in one file before another file will be created (which is renamed by incrementing the number i
). If you use batchsize=1
it's always so that a) happens, i.e. the latest file will be empty. If you choose $countOfRecordsInInput modulo $batchsize > 0
you will not have an empty file at the end (s. https://github.com/metafacture/metafacture-core/blob/f5cc9dc25155ea0d1664e262f199f3d5f97c1316/metafacture-flowcontrol/src/main/java/org/metafacture/flowcontrol/StreamBatchResetter.java#L90 )
c) Not https://github.com/metafacture/metafacture-core/pull/532/files changed the behaviour. It has indeed something to do with encode-marcxml
, it's a consequence of https://github.com/metafacture/metafacture-core/issues/527 where a wrapper is used, where the wrapper also calls the resetStream()
and thus two files were being created every time the stream was resetted. Fixed with 04a63126ad2344c50c6959e6239d7a813d48fb85.
I've also discovered that when an empty process
is triggered in ObjectFileWriter
a linebreak was nonetheless inserted. So there were sometimes (modulo!) files with one byte (the line break). This is fixed in 16b9349a42eb3e142a2c8c71ccee1e75cf9deff4.
Another bug surfaced: if a record was empty a footer was written. This is fixed in 0ea6d233c0c4f65071b7abb89857067ed4eba2e4.
A good way to test this is the following flux (which is useful with CLI, not for playground because of the written files which you cannot see in that web app):
"http://lobid.org/download/marcXml-8-records.xml"
| open-http(accept="application/xml")
| decode-xml
| handle-marcxml
| batch-reset(batchsize="5")
| encode-marcxml
| write("test-output-${i}.xml")
;
it's a consequence of #527 where a wrapper is used
You mean #524, right? The wrapper was introduced in #539, not in #538.
@dr0i i tested it with: https://github.com/TobiasNx/notWorkingFlux/blob/f8c3f5167e0ab254e7efd8498785ba1036cef822/batchResetEmptyFiles/batchResetEmptyFile.flux
See: https://github.com/TobiasNx/notWorkingFlux/commit/f8c3f5167e0ab254e7efd8498785ba1036cef822
The additional empty records are not processed anymore. Which is good :) But I am not sure if I understood your comment correctly.
a) -> empty record at the end is created because of the resetting, right? So the empty record at the end is intentional?
b) what is the difference between write
and write-file
?
Anyway, if you choose batchsize > 0 is no guarantee that you wont get a empty file at the end. you get it always if the number of entering records is dividable by the batchsize. In my example commit above you see that 10 records go in and batchsize is 5. So batchsize is > 1. But we still get an empty record at the end because 10 is dividable by 5 so it creates 2 files with each 5 records and an empty one at the end. (Your test scenario got 8 records in but batchsize is 5, 8 is not dividable by 5 therefore no empty record.)
Not solved as stated.
c) is fixed which is great.
I've also discovered that when an empty
process
is triggered inObjectFileWriter
a linebreak was nonetheless inserted. So there were sometimes (modulo!) files with one byte (the line break). This is fixed in 16b9349.
Looks goo. There seems to be a linebreak between records too. That is not new but I dont know if we need it too? But it does not create any problems therefore no priority: https://github.com/TobiasNx/notWorkingFlux/blob/f8c3f5167e0ab254e7efd8498785ba1036cef822/batchResetEmptyFiles/output/test-output-0.xml#L215
Another bug surfaced: if a record was empty a footer was written. This is fixed in 0ea6d23.
Looks good the footer in empty record is now deleted.
a) -> empty record at the end is created because of the resetting, right? So the empty record at the end is intentional?
No. A new file is intentionally opened. If it's filled with data or not depends on the data, the resetsize etc.
b) ...Not solved as stated....
sure it is. What you said in your words is what mod
does. And this is intentionally.
c) There seems to be a linebreak between records too. That is not new but I dont know if we need it too?
you may want to open a new issue
See: https://github.com/TobiasNx/metafacture_workflows/commit/64ba3e8fdb57c79c4888b4ec526f7aa4a427678c
Running the workflow with Metafix-Runner 1.0.0
When using the flux module
batch-reset
and set the batch size to "1" (| batch-reset(batchsize="1")
. MF creates an empty record after the last transfromation.batch-reset
should not output empty records.e.g.:
or