humlab-sead / sead_clearinghouse_import

Temporary import system for SEAD Clearinghouse
0 stars 0 forks source link

Upload of large XML file fails. #27

Closed roger-mahler closed 7 months ago

roger-mahler commented 8 months ago

The clearing house import system fails to process large XML. The processing generates a system exception caused by exceeding system resources (SSL SYSCALL error: EOF detected).

The current solution is as follows:

  1. Generate XML from input (Excel) file.
  2. Register submission in clearinghouse which uploads the XML file to the database (xml data type).
  3. Parse and transfer tables and column schema, and data rows/records and values to staging (parsed) tables.
  4. Explode data in staging to clearing house SEAD RDB schema (mirrors SEAD public schema, plus extra attributes)
  5. Commit exploded data in clearing house to SEAD public schema.

Step 3. uses PostgreSQL XML features to parse the XML into tabular data which is stored in generic staging tables. The data from the various target tables is stored in a generic key-value tables.

Step 3. fails for large XML data (several hundred Mb) . The parsing of the data needs to be implemented in a more streaming fashion outside of the database. The improved (alternative) solution is:

  1. Generate XML from input (Excel) file.
  2. Register submission in clearinghouse without uploading the XML file to the database (xml data type).
  3. Convert XML to four tabular CVS files (table, column, record and value data).
  4. Upload CSV to four intermediate tables "temp_submission_upload_xyz".
  5. Parse and transfer tables and column schema, and data rows/records and values to staging (parsed) tables.
  6. Explode data in staging to clearing house SEAD RDB schema (mirrors SEAD public schema, plus extra attributes)
  7. Commit exploded data in clearing house to SEAD public schema.
roger-mahler commented 8 months ago

See https://github.com/humlab-sead/sead_change_control/issues/230.