kbase / dts

A data transfer service
https://kbase.github.io/dts/
MIT License
0 stars 0 forks source link

Devise a mechanism for determining whether a file already exists at a destination. #66

Open jeff-cohere opened 3 months ago

jeff-cohere commented 3 months ago

We've talked here and there about how to minimize unnecessary data transfers, and discussed the merits and drawbacks of various approaches. In particular, I'm not crazy about using a log to figure out where a file should or shouldn't be--I'd rather ask the source of truth itself!

In this connection, I'm considering an additional endpoint for the Database specification that searches for files by their MD5 checksums specifically, instead of using search queries. This endpoint would accept an array of checksums and return their corresponding file IDs (or null in the case that they aren't found).

Obviously this is a very complicated problem to solve, and the above approach doesn't begin to handle all of the nastiness to do with files that have been transferred but don't yet have IDs, etc. But I think it would at least give us a solid point of departure. I think I can probably stand up a JDP file checksum search endpoint that uses JAMO.

jeff-cohere commented 2 months ago

JAMO does maintain an md5sum field in its records, but it's hard to know how many records have this populated. Also, it's not an indexed field, which produces pretty terrible performance for queries that select records related to it. I've had no luck getting results from JAMO queries that reference known md5 checksums. So unless I'm overlooking something, it doesn't look like JAMO can provide this capability.

The JAMO documentation says that it's possible to ask the team to add another index. That's an option to explore as this becomes more important.