keeps / dbptk-ui

DBPTK base UI for both Desktop and Enterprise
https://database-preservation.com
GNU Lesser General Public License v3.0
23 stars 9 forks source link

5GB xml with timestamps crashes indexing process #310

Open Laurira opened 2 years ago

Laurira commented 2 years ago

We received big database as siard file generated with dbptk developer.

  1. I loaded into dbtpk Enterprise, all worked nicely.
  2. I validated it, no serious errors.
  3. I clicked "Browse" button and it started indexing.
  4. During 67% it crashed. The error was:
    RESTException: Remote exeption cause by GenericException: Could not convert the database.
    caused by IllegalArgumentException: Invalid format: "2019%02-15T07:27:21.858000Z" is malformed at "%02-15T07:27:21.858000Z".
    2022-02-15 21:44:18.970 ERROR 1 --- [nio-8080-exec-4] c.d.m.siard.in.content.SAXErrorHandler   : line: 2; column: 1348776609; cvc-pattern-valid: Value '2019%02-15T07:27:21.858000Z' is not facet-valid with respect to pattern '\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d*)Z?' for type 'dateTimeType'.

So at first glance it seems like database has malformed data with "%" symbol in the date field. This is the header.xml

                      <column>
                          <name>modified_date</name>
                          <type>TIMESTAMP</type>
                          <typeOriginal>timestamp</typeOriginal>
                          <nullable>true</nullable>
                      </column>

Now when I changed this <type>TIMESTAMP</type> to <type>CHARACTER VARYING(99)</type> then everything worked and the solr-indexing process (Browse-functionality) worked to the end and everything was nicely searchable. But I could not find the value "2019%02-15T07:27:21.858000Z"

Next I tried to look up the same value directly from siard/content.../table52.xml

Turned out that this is the biggest xml file in the siard package. It is 5GB big and if I had split it into multiple pieces then it was searchable. But such value "2019%02-15T07:27:21.858000Z" could not be found.

So my only guess is that big xml files with values as TIMESTAMP will ruin the process of solr indexing and weird error will be produced.