piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.12k stars 380 forks source link

S3 SinglepartWriter writes on exception when garbage collected #819

Open donsokolone opened 2 months ago

donsokolone commented 2 months ago

Problem description

When there is an unhandled exception raised in context of SinglepartWriter side-effect occurs when writer is garbage-collected which results in unwanted write of partial file into S3.

2024-04-20T06:02:19.817140Z [debug    ] Parsed JSON record             aws_request_id=00000000-0000-0000-0000-000000000000 instance_id=cf450c4a-21b7-452b-ad79-291cb87b11ab records_count=24 target_uri=s3://vf-localstack-nora-pii-data-retention/anonymized/vf-localstack-nora-pii-data-retention/data/vf-da-prod-nora-cdp-blueconic-dck-consumer-profiles-1-2023-03-25-03-46-20-0482e5d3-5733-4ed6-b836-a6bdd2401a2d.gz trace_id=00000000-0000-0000-0000-000000000000 uri=s3://vf-localstack-nora-pii-data-retention/data/vf-da-prod-nora-cdp-blueconic-dck-consumer-profiles-1-2023-03-25-03-46-20-0482e5d3-5733-4ed6-b836-a6bdd2401a2d.gz
Traceback (most recent call last):
  File "/Users/tsokolowski/Dev/code-werkz/vfc/vfn/gdf_pii_data_retention/src/lambda/anonymizer-s3/local.py", line 43, in <module>
    output = handler(payload, SimpleNamespace(aws_request_id=trace_id))
  File "/Users/tsokolowski/Dev/code-werkz/vfc/vfn/gdf_pii_data_retention/src/lambda/anonymizer-s3/src/anonymizer_s3/app.py", line 31, in handler
    anonymize(settings, di_container)(inbound_payload)
  File "/Users/tsokolowski/Dev/code-werkz/vfc/vfn/gdf_pii_data_retention/src/lib/piilib/src/piilib/s3/anonymizer/anonymize.py", line 127, in _
    _, lookups_hits = dispatch(task)
  File "/Users/tsokolowski/.pyenv/versions/3.9.18/lib/python3.9/functools.py", line 888, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/tsokolowski/Dev/code-werkz/vfc/vfn/gdf_pii_data_retention/src/lib/piilib/src/piilib/s3/anonymizer/files/json_file.py", line 247, in _
    for raw_record_in, record_delimiter in json_parse(fin):
  File "/Users/tsokolowski/Dev/code-werkz/vfc/vfn/gdf_pii_data_retention/src/lib/piilib/src/piilib/s3/files/json_file.py", line 71, in json_parse
    buff = io.StringIO(old_buff.read())
KeyboardInterrupt
2024-04-20T06:02:20.791680Z [debug    ] smart_open.s3.SinglepartWriter('vf-localstack-nora-pii-data-retention', 'anonymized/vf-localstack-nora-pii-data-retention/data/vf-da-prod-nora-cdp-blueconic-dck-consumer-profiles-1-2023-03-25-03-46-20-0482e5d3-5733-4ed6-b836-a6bdd2401a2d.gz'): direct upload finished [smart_open.s3] target_uri=s3://vf-localstack-nora-pii-data-retention/anonymized/vf-localstack-nora-pii-data-retention/data/vf-da-prod-nora-cdp-blueconic-dck-consumer-profiles-1-2023-03-25-03-46-20-0482e5d3-5733-4ed6-b836-a6bdd2401a2d.gz uri=s3://vf-localstack-nora-pii-data-retention/data/vf-da-prod-nora-cdp-blueconic-dck-consumer-profiles-1-2023-03-25-03-46-20-0482e5d3-5733-4ed6-b836-a6bdd2401a2d.gz

Reason for this behaviour is SinglepartWriter inherits io.BufferedIOBase which invokes close() in __del__() descriptior.

Steps/code to reproduce the problem

Versions

macOS-14.2.1-x86_64-i386-64bit
Python 3.9.18 (main, Nov 30 2023, 12:53:32)
[Clang 15.0.0 (clang-1500.0.40.1)]
smart_open 7.0.4
ddelange commented 2 months ago

hi @donsokolone :wave:

how about setting self._buf = None in terminate? then close is a no-op by the time the SinglepartWriter is garbage collected, analogous to MultipartWriter.

cc @mpenkov

donsokolone commented 2 months ago

@ddelange This is exactly what the fix should be, as I mentioned in #763. I will PR it in few moments.