At ingestion, which should come first: the load or the store?

jcrossley3 commented 1 month ago

Currently, we store the doc's bytes before we load its content into the database...

https://github.com/trustification/trustify/blob/c55d1598035cef3df1f979acf01c50a174129f4a/modules/ingestor/src/service/mod.rs#L171-L179

But if the load fails, I think the bytes will remain stored, with no API to delete them.

Assuming the load is a single transaction -- all writes rolled back if any fail -- does it make more sense to load first? In that case, if the subsequent store fails, at least there's an API to delete the entities we just created.

ctron commented 1 month ago

Loading first might lead to the result that we have data ingested, but no source document.

Storing first might lead to the situation that we have a dead S3 (FS) entry, but no DB record of it. And would never serve it. A second upload would overwrite that file, and load the data.

I believe the latter approach is the more consistent one.

jcrossley3 commented 1 month ago

A second upload would overwrite that file, and load the data.

Not if it was something about the file that failed the load. Correcting the invalid file and reingesting would result in a different digest and the bytes from the first upload would remain indefinitely.

jcrossley3 commented 1 month ago

Not to mention, we're making it very easy for a bad actor to run up our S3 bill by posting large, invalid docs.

ctron commented 1 month ago

A second upload would overwrite that file, and load the data.

Not if it was something about the file that failed the load. Correcting the invalid file and reingesting would result in a different digest and the bytes from the first upload would remain indefinitely.

Correct. In that case it would never succeed. But in the case it was a temporary DB issue, it would.

ctron commented 1 month ago

Not to mention, we're making it very easy for a bad actor to run up our S3 bill by posting large, invalid docs.

Invalid docs would never show up. Contrary to storing later, where we might end up with a situation of having valid docs stored in the DB, but not in the S3/FS.

Of course we could delay the commit post-store, but that would lead to the same situation, that we end up with a (FS) stored document, which we don't have in the database. So that wouldn't really solve the issue IMO.

If you want to clean up failed upload attempts in S3/FS, then it might make sense to create an upload marker store (db or elsewhere), recording the digest and a timestamp. Cleaning up after X hours/days.

jcrossley3 commented 1 month ago

Not to mention, we're making it very easy for a bad actor to run up our S3 bill by posting large, invalid docs.

Invalid docs would never show up.

You lost me. If the validation occurs during the load, and I store before I load, I risk storing invalid data, don't I?

ctron commented 1 month ago

You lost me. If the validation occurs during the load, and I store before I load, I risk storing invalid data, don't I?

Data which is never accessible.

jcrossley3 commented 1 month ago

But it would exist in S3, whether accessible through our app or not. Hence, vulnerable to bad actors.

ctron commented 1 month ago

But it would exist in S3, whether accessible through our app or not. Hence, vulnerable to bad actors.

I would consider having direct S3 bad already. I am not sure what the exact issue here is.

jcrossley3 commented 1 month ago

I am not sure what the exact issue here is.

Then one or more of us sucks at communication.

If you configure trustd with your S3 creds, and continually POST invalid docs to /api/v1/advisory or /api/v1/sbom, I expect you'll eventually max out your S3 capacity, which will be expensive for you.

ctron commented 1 month ago

If you configure trustd with your S3 creds, and continually POST invalid docs to /api/v1/advisory or /api/v1/sbom, I expect you'll eventually max out your S3 capacity, which will be expensive for you.

Thus my proposal to implement something which allows to figure out, at a later time, which documents are stale:

If you want to clean up failed upload attempts in S3/FS, then it might make sense to create an upload marker store (db or elsewhere), recording the digest and a timestamp. Cleaning up after X hours/days.

Having the following steps (only executing the next if all previous had been successful):

Create an upload marker (with digest)
Store in FS/S3
Ingest into DB
Commit to DB
Remove the marker

Then periodically:

Iterate over all markers X minutes or older
Check if the document was ingested
Delete non-ingested documents
Delete marker

If 1. fails, nothing will happen. If 2., 3 fail, there will be a dangling marker, nothing will be in the DB If 4., 5. fails, there will be a dangling marker, data will be in the DB

Periodically, all dangling and timed out documents will be deleted. But only if they aren't found in the DB. Using the markers removes the stress on both the DB and the FS/S3, as only affected entries get checked. Reducing S3 API calls (cost).

jcrossley3 commented 1 month ago

You're proposing we add more complexity and another race condition without removing the vulnerability. What's your period for checking the markers? An hour? I can easily fill up a bucket with junk that fast. A few ms? Just more stress on the db.

If nothing else, you've convinced me we need to load/validate before we store. If either fails we can make a good faith attempt to clean up (the delete API is already in place) and return an error. The db/storage stress is minimized, data inconsistency risk is negligible and most importantly, the DOS/cost risk is eliminated.

ctron commented 1 month ago

If you check first, you can still fail on the commit. You're just shifting the issue, not solving it. Maybe sketch it out.

Not sure S3 buckets get full. You can also fill it with reasonable data, as we've seen. Cleaning up every hour is just fine.

jcrossley3 commented 1 month ago

Really? You're questioning my ability to fill up an S3 bucket, seemingly oblivious to how expensive it'll be for me to try? I won't put it past our QE engineers to try. I've got the jira's to prove it.

Neither of us is "solving" the issue. We're each trading off different risks. I'm uncomfortable with the risks of your complexity and the window within which a bad actor can run up S3 charges. And you're uncomfortable with my risk of data inconsistency (correct me if I'm wrong).

Maybe sketch it out.

if db_load is ok {
  if store is ok {
    return Success
  } else {
    if db_delete is ok {
      return StorageError
    } else {
      return InconsistentDataError(enough_info_to_fix_it)
    }
  }
} else {
  return DbError()
}

ctron commented 1 month ago

I am worries exactly about this:

return InconsistentDataError(enough_info_to_fix_it)

At this point, you we have an invalid state, are cannot guarantee that we can recover from it.

On the other hand, we face the challenge that "someone" could create lots of data in an S3 bucket, which we'd clean up later.

I'd always choose consistency of data over "too much data for a while".

jcrossley3 commented 1 month ago

You seem pretty firm in your stance that it doesn't matter how much S3 costs a customer might incur to ensure data consistency. Cool.

jcrossley3 commented 1 month ago

At this point, you we have an invalid state, are cannot guarantee that we can recover from it.

We can absolutely report to the user which entity they need to delete in order to "recover" from our inability to delete the data we just inserted a microsecond ago. We're talking about an extremely rare possibility.

JimFuller-RedHat commented 1 month ago

FWIW - there are techniques I have seen for using a temp space in S3 bucket and upon successful db record we move document from temp space to final resting place in bucket - the temp space would have an expiration policy set which would clean up documents (which presumably have no db record) after period of time . Similar could be achieved with 2 s3 buckets but yes at the cost of added infra complexity.

Stepping back - this is not a 'forever decision' eg. we can go one way and if we find its not good (with testing) fall back to the other way. I think testing will confirm the right way.

jcrossley3 commented 1 month ago

I honestly didn't think anyone who might bring this issue up in a status meeting was paying attention to our pedantry, but here we are. So let's be clear...

At this point, we have an invalid state, and cannot guarantee that we can recover from it.

I think the term "invalid state" is a reach. If we do nothing more than switch the order in the logic, i.e. load before store, and the store operation failed for some reason, the only risk is that our download API would fail, so the user would subsequently be unable to download the doc he just handed us. But all the other features of trustify would continue to work just fine. This is a credit to the design. Once ingested, we don't care about the original doc, other than to be able to conveniently return it upon request. This, imho, is only in the strictest definition of the term a "data inconsistency" error from which we'd need to "recover".

On the other hand, we face the challenge that "someone" could create lots of data in an S3 bucket, which we'd clean up later.

You realize there are access charges for S3, right? No matter how often you "clean up later", you're still incurring charges for every doc you POST and then later DELETE. Why expose ourselves to that vulnerability at all?

I'd always choose consistency of data over "too much data for a while".

Sure, but at what cost? Everything is a tradeoff.

ctron commented 1 month ago

Once ingested, we don't care about the original doc, other than to be able to conveniently return it upon request.

I'd strongly disagree here. To my understanding we also ensure archiving the document. Assuming we'll have it for future processing. Like today, we rely on this for the v1 -> v2 migration.

On the other hand, we face the challenge that "someone" could create lots of data in an S3 bucket, which we'd clean up later.

You realize there are access charges for S3, right? No matter how often you "clean up later", you're still incurring charges for every doc you POST and then later DELETE. Why expose ourselves to that vulnerability at all?

Sure, that's the deal of cloud services. If someone repeatedly uploads a document, then that's we need to deal with. Can be done through an API proxy with rate limiting for example. But I'd see that a problem independent to the problem at hand. If someone uploads the same, faulty, document over and over again, we'd only store one copy of that anyway.

jcrossley3 commented 1 month ago

I think a reasonable constraint for our app is to not store objects in S3 unless we can successfully load them into our db, because S3 is notoriously expensive.

As it stands now, a bad actor can easily generate large, unique, invalid files and store them in our S3 bucket within whatever rate limit we impose, within whatever window we periodically sweep those invalid files.

ctron commented 1 month ago

As it stands now, a bad actor can easily generate large, unique, invalid files and store them in our S3 bucket within whatever rate limit we impose, within whatever window we periodically sweep those invalid files.

A bad actor can do the same with valid files.

jcrossley3 commented 1 month ago

True. I was about to counter with "a good actor could delete them through our api", but then realized we never delete the stored docs from S3, valid or not. That's probably a bug.

trustification / trustify

At ingestion, which should come first: the load or the store? #801