Closed elipe17 closed 3 months ago
Still attempting to reproduce, pending office hours
During office hours on Fri 16, we did the following investigation:
ACTIONS:
NOTE:
scripts:
# extracts S3 creds. First login to target env
export SERVICE_INSTANCE_NAME=tdp-datafiles-{environment} # environment=prod, staging, dev
export KEY_NAME=mo-dev-df
#cf create-service-key "${SERVICE_INSTANCE_NAME}" "${KEY_NAME}"
export S3_CREDENTIALS=$(cf service-key "${SERVICE_INSTANCE_NAME}" "${KEY_NAME}" | tail -n +2)
export AWS_ACCESS_KEY_ID=$(echo "${S3_CREDENTIALS}" | jq -r '.access_key_id')
export AWS_SECRET_ACCESS_KEY=$(echo "${S3_CREDENTIALS}" | jq -r '.secret_access_key')
export BUCKET_NAME=$(echo "${S3_CREDENTIALS}" | jq -r '.bucket')
export AWS_DEFAULT_REGION=$(echo "${S3_CREDENTIALS}" | jq -r '.region')
# s3 command examples
aws s3api list-object-versions --bucket {bucket_name} > objects_versions_staging.txt
aws s3api list-objects --bucket {bucket_name} --query 'Contents[].{Key: Key}
aws s3api get-object --bucket {bucket_name} --key tdp-backend-prod/data_files/2023/Q1/37/Aggregate Data/Section_3_Q1FFY2023_text.txt
Tested out in develop, staging, prod
develop, lots of files missing, no pattern
staging, very few files missing (no pattern)
prod, 16 files missing. most look to be sequential/concurrent submissions
could somewhat replicate by killing s3 connection locally in the middle of file upload (partially uploaded to s3, object created in backend but stuck in "Pending", parsing fails to pull file)
were not able to find corresponding logs in prod for pending/failed datafile - parsing seems to have completed successfully when it ran (file id 4282, 6-26-2024 21:16:52)
initial upload and parsing succeeded - for some reason, datafiles seem to be being deleted from s3 after parsing. versioning does not seem to be related, neither does reparsing. We also have no visibility into when the datafile disappeared.
boto3 and botocore seem to handle retries/timeouts rather gracefully, not allowing the DataFile to be created in the backend during a failure.
For these reasons, we don't have any way to identify the problem or solution. We can recommend OFA introduce redundancy in file uploads (store the flie separately in another location, re-enable titan sftp upload, copy of s3, etc)
Not pursuing redundancies, spinning up new tickets to pivot on work. Given new direction - team has deemed this ticket able to be closed (8.23 - cross product sync) @ADPennington
Thank you for taking the time to let us know about the issue you found. The basic rule for bug reporting is that something isn't working the way one would expect it to work. Please provide us with the information requested below and we will look at it as soon as we are able.
Description
The number of files listed in the DAC does not match the number of files listed in S3. For large chunks of datafiles in the admin console the download link does not work. S3 returns an "invalid key" error indicating that the path the DAC has saved for the datafile is invalid or that the file no longer exists in S3. Based on some preliminary testing it seems to be the latter and the file is no longer in S3.
Action Taken
Log into DAC in develop/staging/prod and start trying to download some of the files
Eventually one will fail
Also used the
aws cli
to list all files in our buckets and confirmed that the count of files in the S3 bucket do not match the count of files in the DAC.What I expected to see
What I did see
Other Helpful Information
The files below are lists of files that S3 is reporting it is tracking for each environment. The last line in each file indicates what the DAC for that environment thinks the number of files is. prod_files.txt staging_files.txt develop_files.txt